Web service performance monitoring. Part I
The better the development process is set up, the less likely will be the performance problems in the release. On the other hand, they cannot be completely avoided for the simple reason that during the development process “assumptions” were made regarding the operating conditions of the web service, and life constantly makes its own adjustments.
A lot depends on how often such problems appear and how quickly they are fixed - user satisfaction with the service, developer reputation, etc. How can you deal with performance issues?
One scenario is a passive reaction, i.e. solving problems as they become available. In this case, the support service accumulates complaints before the onset of some “critical mass” and then attracts developers. Developers spend some time finding and fixing problems, then the web service again begins to quickly process tasks.
The main drawback of this option is that users have time to fully “enjoy” the responsiveness of the web service before developers take care of it. But the problem after this has yet to be found. Another drawback is that developers must work on deadlines “for yesterday,” which is also not impressive.
Another option, pseudo-active implementation. Some utilities are installed to monitor everything and everything, then the main shaman with a tambourine looks at the charts from time to time and tries to identify the fact of a problem by them.
This option is not much different from the first, because the shaman gets tired of looking at boring graphs and numbers, and most often it comes down to the same first option. But even if the shaman manages to recognize the problem before an avalanche of complaints, it still takes time to find and fix the problem.
It requires a different scenario, which would allow to “keep abreast” with the minimum costs and quickly localize the source of the problems.
Proactive monitoring means two things:
It is convenient to build a notification on a ready-made monitoring system, of which there are many in the world. We will not consider the kneecap, because they are the lot of enthusiasts. But from the professional I would like to mention cacti , nagios and zabbix . However, only the hot and mutually beloved Zabbix (or did I miss something in cacti ?) Is suitable for the notification requirements , and nagios is not very adapted for storing historical data, which is very useful for analyzing problems.
Using alerts will seriously relieve local shamans, because instead of staring at hundreds of charts a day on the fly, it is enough to carry the phone in your pocket. The only question is what and how to monitor.
If we talk about performance, and we do not touch on other topics now, then it’s enough to monitor only one parameter - the average response time. If it exceeds the value specified in the requirements for the web service, then we have a problem and we need to deal with it, moreover, promptly. To do this, you need to spend some time at some point and plan your work to localize problems.
You should start the plan by drawing up a request processing scheme. An example of such a scheme can be:
There are two types of components on this scheme - optional and mandatory. Required, such as a web server and a handler, are always present in the request processing and access to them is indicated by a solid line. But components such as the file system, memkesh and muscle can be optional, and therefore are indicated by a dashed line.
Having such a scheme, you should paint a list of possible problems for each component and a methodology for resolving them, so that developers should be involved in exceptional cases. An example of such a list might be the following:
In fact, the third paragraph consists of many subparagraphs, some of which are decided administratively and only in rare cases should developers be involved.
The advantage of having such a plan is obvious - most of the problems are resolved, if not instantly, then at specific times. However, you need to work hard both to draw up a plan, and to finalize the system, but more on that later.
To be continued...
A lot depends on how often such problems appear and how quickly they are fixed - user satisfaction with the service, developer reputation, etc. How can you deal with performance issues?
One scenario is a passive reaction, i.e. solving problems as they become available. In this case, the support service accumulates complaints before the onset of some “critical mass” and then attracts developers. Developers spend some time finding and fixing problems, then the web service again begins to quickly process tasks.
The main drawback of this option is that users have time to fully “enjoy” the responsiveness of the web service before developers take care of it. But the problem after this has yet to be found. Another drawback is that developers must work on deadlines “for yesterday,” which is also not impressive.
Another option, pseudo-active implementation. Some utilities are installed to monitor everything and everything, then the main shaman with a tambourine looks at the charts from time to time and tries to identify the fact of a problem by them.
This option is not much different from the first, because the shaman gets tired of looking at boring graphs and numbers, and most often it comes down to the same first option. But even if the shaman manages to recognize the problem before an avalanche of complaints, it still takes time to find and fix the problem.
It requires a different scenario, which would allow to “keep abreast” with the minimum costs and quickly localize the source of the problems.
Proactive monitoring
Proactive monitoring means two things:
- Active notification of problems, whether by phone, sms, gill, icq or soap;
- A clear plan of action for localization and troubleshooting.
It is convenient to build a notification on a ready-made monitoring system, of which there are many in the world. We will not consider the kneecap, because they are the lot of enthusiasts. But from the professional I would like to mention cacti , nagios and zabbix . However, only the hot and mutually beloved Zabbix (or did I miss something in cacti ?) Is suitable for the notification requirements , and nagios is not very adapted for storing historical data, which is very useful for analyzing problems.
Using alerts will seriously relieve local shamans, because instead of staring at hundreds of charts a day on the fly, it is enough to carry the phone in your pocket. The only question is what and how to monitor.
If we talk about performance, and we do not touch on other topics now, then it’s enough to monitor only one parameter - the average response time. If it exceeds the value specified in the requirements for the web service, then we have a problem and we need to deal with it, moreover, promptly. To do this, you need to spend some time at some point and plan your work to localize problems.
You should start the plan by drawing up a request processing scheme. An example of such a scheme can be:
There are two types of components on this scheme - optional and mandatory. Required, such as a web server and a handler, are always present in the request processing and access to them is indicated by a solid line. But components such as the file system, memkesh and muscle can be optional, and therefore are indicated by a dashed line.
Having such a scheme, you should paint a list of possible problems for each component and a methodology for resolving them, so that developers should be involved in exceptional cases. An example of such a list might be the following:
# | Symptom | Cause | Reaction |
---|---|---|---|
1. | Average idle handler timed out | Influx of users | Horizontal scaling |
2. | Average read / transmit request time exceeded adequate value | We have exceeded network resources. | Tariff Plan Change |
3. | Average request processing time exceeded acceptable value | Problems with the code \ base \ etc. | It's time to attract developers |
In fact, the third paragraph consists of many subparagraphs, some of which are decided administratively and only in rare cases should developers be involved.
The advantage of having such a plan is obvious - most of the problems are resolved, if not instantly, then at specific times. However, you need to work hard both to draw up a plan, and to finalize the system, but more on that later.
To be continued...