MaxRokatansky May 24, 2019 at 11:21

Why don't engineers care about application monitoring?

Transfer

All with Friday! Friends, today we continue a series of publications devoted to the DevOps Practices and Tools course , because classes in the new group on the course will start at the end of next week. So, let's begin!

Monitoring is easy . This is a known fact. Lift Nagios, run NRPE on the remote system, configure Nagios on the NRPE TCP port 5666 and you have monitoring.

It is so easy that it is not interesting. Now you have the basic metrics for processor time, disk subsystem, RAM, which come by default in Nagios and NRPE. But in reality, this is not “monitoring” as such. This is just the beginning.

(Usually they install PNP4Nagios, RRDtool and Thruk, set up notifications in Slack and go directly to nagiosexchange, but for now let it go).

Good monitoring is actually quite complicated, you really need to know the internals of the application you are watching.

Is monitoring difficult?

Any server, whether Linux or Windows, by definition will serve some purpose. Apache, Samba, Tomcat, file storage, LDAP - all these services are more or less unique in one or more respects. Each has its own function, its own characteristics. There are different ways to get metrics, KPI (key performance indicators), interesting for you when the server is under load.

The author of the Luke Chesser photo on Unsplash

(I would like my dashboards to be painted in neon-blue colors - sighing dreamily - ... hmm ...)

Any software that provides services should have a mechanism for collecting metrics. Apache has a module mod-statusthat displays the server status page. Nginx has -stub_status. Tomcat has JMX or special web applications that show key metrics. MySQL has a “show global status” command, etc.
So why don't developers embed such mechanisms in the applications they create?

Do developers only do this?

A certain level of indifference to embedding metrics is not limited to developers. I worked in companies where I developed applications using Tomcat and did not produce any of my metrics, no service activity logs, except for general Tomcat error logs. Some developers generate an abundance of logs that mean nothing to the system administrator, who was unlucky to read them at 3:15 in the morning.

Posted by Tim Gouw on Unsplash

System engineers who allow such products to be released should also have some responsibility for the situation. Few system engineers have time and care to try to get meaningful metrics from the logs, without the context of these metrics and the ability to interpret them in the light of application activity. Some do not understand what benefit they can get from this, except for indicators such as "something is currently (or will be soon) wrong."

A change in thinking regarding the need for metrics should occur not only among developers, but also among system engineers.

For any system engineer who needs to not only respond to critical events, but also to ensure their absence, the absence of metrics is usually an obstacle to this.

However, system engineers usually do not dig into the code, making money for their company. They need leading developers who understand the importance of the responsibility of a system engineer in detecting problems, raising awareness of performance problems, and the like.

This devops thing

The devops mentality describes the synergy of developer thinking (devs) and exploitation (ops). Any company claiming to be “doing devops” should:

to say what they probably don’t do (a hint at the meme from the movie “Princess Bride” - “I don’t think it means what you think it means!”)
promote a position of continuous product improvement.

You cannot improve a product and know that it has been improved if you do not know how it currently works. You will not be able to find out how a product works if you do not understand how its components work, the services on which it depends, its main pain points and bottlenecks.
Unless you are observing potentially bottlenecks, you cannot follow the Five Why technique when writing Postmortem. You cannot collect everything on one screen to see how the product works or to find out how it looks “normal and happy.”

Left shift, LEFT, I SAID, NOOOOOOOO—

For me, one of the key principles of Devops is “shift left”. A shift to the left in this context means a shift in the ability ( not responsibility , but only the ability) to do what system engineers usually care about, for example, create performance metrics, use logs more efficiently, etc., to the left in the software delivery life cycle ( Software Delivery Life Cycle).

The author of the NESA by Makers photo on Unsplash

Software developers should be able to use and know the monitoring tools that the company uses to monitor in all its forms, metrics, logging, monitoring interfaces and, most importantly,watch how their product works in production . You cannot force developers to invest time and effort into monitoring until they can see the metrics and influence how they look, how the product owner will present their CTOs at the next briefing, etc.

In short

Bring the horse to the water. Show developers how many problems they can avoid for themselves, help them identify the right KPIs and metrics for their applications so that there is less shouting from the product owner that the CTO is shouting at. Bring them to the light, softly and calmly. If this does not work out, then bribe, threaten and persuade either them or the owner of the product in order to quickly obtain these metrics from applications, and then draw diagrams. This will be difficult, as it will not be considered a priority, and there will be many income-generating projects awaiting implementation in the product roadmap. Therefore, you will need a business case to justify the time and money spent on implementing monitoring in the product.
Help system engineers get enough sleep. Show them that using a release release checklist for any product is good. And checking that all applications in production are covered with metrics will help you get a good night's sleep, letting developers see what works and where it doesn't work. However, the right way to annoy and upset any developer, product owner, and CTO is to push sticks into the wheels and resist. This behavior will affect the release date of any product if you wait again until the last minute, so again shift left and include these issues as soon as possible in the project plan. If necessary, make your way to product meetings. Wear a fake mustache and felt or something like that, it never fails. Report your concerns
Make sure that both developers (dev) and operation (ops) understand the meaning and consequences of moving product metrics to the red zone. Do not leave operation as the only guard over the product’s performance; make sure that developers also participate in it (#productsquads).
Logs are a great thing, but metrics too. Combine them and do not let your logs become garbage in a huge flaming ball of futility. Explain and show the developers why no one except them will understand their logs, show them what it feels like to watch useless logs at 3:15 in the morning.

Photo by Marko Horvat on Unsplash

That's all. New material will be released next week. If you want to learn more about the course, we invite you to an open day , which will be held on Monday. And now we are traditionally waiting for your comments.

Tags: