sunnybear March 16, 2012 at 13:02

Distributed and cloud infrastructure monitoring

In the last article, I went through a review on various types of monitoring of simple web projects and websites, when the site does not require a level of reliability of 99.99%, when the reaction time can be hours or days. In general, when everything is simple. In this article I will reveal the mechanisms for monitoring cloud infrastructure when a simple signal is available / not available is not enough at all to understand what the problems are and how to solve them quickly. Or, when solving a problem may require a large number of actions that can only be partially automated.

Typically, the reliability level of the project infrastructure allows you to leave the response time to problems the same - hours or even days. But there are a number of places where decisions should be made in a (semi) automatic mode to eliminate the human factor and minimize system downtime. The triggers of such decisions will be discussed below. I want to immediately note that almost all the monitoring technologies described are used in the new cloud service of the social intranet - Bitrix24.

It never happens

When building a fault-tolerant distributed infrastructure, in addition to providing several levels of reliability, several levels of system monitoring are usually laid. Among them are:

Built-in monitoring , issuing data on the physical parameters of the servers (disk, memory, processor, network, system, etc.). Munin is often used as a solution ( Habré has already written a lot about it ) with a large set of plug-ins that allow you to control every problem point. Plugins, in fact, are console scripts that check a specific system parameter with a given frequency. Theoretically, even at this level, you can use the trigger mechanism to carry out "unloading" actions with the server. But in practice, the next level of monitoring is used for decision-making, built-in monitoring is used only to collect statistics and analyze system parameters from the inside.
Internal monitoring involves the prevention of the state of the entire infrastructure or its part at the level of the infrastructure itself. This means that, along with working servers (applications, databases), the system must have servers that monitor its status and transmit this information (in critical cases) to the address (for example, sending SMS notifications, or launching new application servers, or recording information about the working application servers to the balancer). The most frequently used solution here is Nagios ( articles on Habré ) with a large number of checks (several hundred or thousands usually). In addition, Pinba ( Habr) in the case of PHP applications for a more accurate analysis of problems.
Usually, the two previous monitoring levels are enough to detect and solve all problems with the infrastructure, but often (in the case of clouds) there is intermediate monitoring when monitoring the status of all infrastructure servers, as well as analyzing all requests passing through the allocated power in the cloud. Intermediate monitoring is used as an additional level of service quality control (for example, it is easy to track the number of 500 errors, even if the application servers are operating normally) and decide on switching capacities between geo-clusters (for example, this is possible with Amazon ).
Finally, external monitoring is used to analyze the situation on the part of users. Even if the system is working properly, the connection between the servers is not broken, the servers respond quickly and stably, users may feel problems using the service, and this will depend on the general condition of the Network. At this level, additional triggers are possible for making a decision about switching users to other geo-clusters (for example, European Bitrix24 users should be sent to a European data center, and American users to American) to improve the quality of service. Also, this monitoring level can be used to further verify the results of internal and intermediate monitoring.

Integrated monitoring

Munin provides a large amount of information about the status of the required server. The most frequently checked points include:

PHP check (runtime, memory usage, hits per process, etc.)
Nginx check (response codes, memory usage, number of processes)
Mysql check (number of queries, memory usage, query execution time)
Disk check
Checking the mail daemon
Network check
System check

A few pictures of monitoring a real system (loading and database)

For example, a script to check free space in an InnoDB MySQL table

## Tunable parameters with defaults
MYSQL = "$ {mysql: - / usr / bin / mysql}"
MYSQLOPTS = "$ {mysqlopts: --- user = munin --password = munin --host = localhost}"
WARNING = $ {warning: -2147483648} # 2GB
CRITICAL = $ {critical: -1073741824} # 1GB
## No user serviceable parts below
if ["$ 1" = "config"]; then
    echo 'graph_title MySQL InnoDB free tablespace'
    echo 'graph_args --base 1024'
    echo 'graph_vlabel Bytes'
    echo 'graph_category mysql'
    echo 'graph_info Amount of free bytes in the InnoDB tablespace'
    echo 'free.label Bytes free'
    echo 'free.type GAUGE'
    echo 'free.min 0'
    echo 'free.warning' $ WARNING:
    echo 'free.critical' $ CRITICAL:
    exit 0
fi
# Get freespace from mysql
freespace = $ ($ MYSQL $ MYSQLOPTS --batch --skip-column-names --execute \
    "SELECT table_comment FROM tables WHERE TABLE_SCHEMA = 'munin_innodb'" \
    information_schema);
retval = $?
# Sanity checks
if ((retval> 0)); then
    echo "Error: mysql command returned status $ retval" 1> & 2
    exit -1
fi
if [-z "$ freespace"]; then
    echo "Error: mysql command returned no output" 1> & 2
    exit -1
fi
# Return freespace
echo $ freespace | awk '/ InnoDB free: / {print "free.value", $ 3 * 1024}'

A large list of popular plugins can be found in the GIT repository.

Internal monitoring

Nagios as a monitoring solution is certainly good. But you need to be prepared for the fact that in addition to it you will have to use your own scripts and (or) Pinba (or a similar solution for your programming language). Pinba operates with UDP packets and collects information about script execution time, memory size and error codes. In principle, this is enough to create a complete picture of what is happening and ensure the required level of service reliability in automatic mode.

At the level of internal monitoring, it is already possible to make decisions on the allocation of additional capacities (if this is possible automatically, then it’s enough to simply monitor the average processor load on application or database servers, if this is done manually, you can send letters or jabber messages) or turning them off. Also, in the event of an abnormal number of errors (usually this occurs when the equipment fails or an error occurs in the new version of the web service, and which is the reason, you can always install it through additional checks), you can send emergency notifications via SMS or call.

It is also very convenient to configure the automatic addition (or deletion) of tests when increasing the verification points (for example, servers or user sites) with the given templates: for example, checking the main page, allocating PHP runtime, allocating memory usage for PHP, the number of nginx and PHP errors.

Intermediate monitoring

Monitoring at the cloud infrastructure level does not offer a large number of providers, and it is, rather, informational: real decisions are made either on the basis of internal data or the external state of the system. At the intermediate level, you can only collect statistics or confirm the internal state of the infrastructure.

The following verification options are available for Amazon ( CloudWatch ):

Full traffic (by instances and aggregate).
The number of responses from the balancer.
The state of the instances (including infrastructure ones that perform internal monitoring).
And a number of others that can be combined with internal ones, but it is better to leave the maximum of logic still at the level of internal monitoring.

Based on the results of intermediate monitoring (at the level of balancers), one can make an informed decision on the allocation or closure of machines (instances) in a cluster. This is exactly what Bitrix24 does: as soon as the load state of application servers becomes too large (more than 60%), new instances start to run. Conversely, when the load is less than 40%, the instances are closed.

External monitoring

Here the choice of solutions is very large. If you need to monitor the status of servers around the world, then the best solution is Pingdom . For Russian realities, PingAdmin , Monitorus or WEBO Pulsar is suitable (the last two have a network of servers in Russia). It is especially convenient to set up a check from several points (for example, Moscow + St. Petersburg) and to pull a remote notification script if the service is not available within 1-2 minutes. If at the same time there are any problems inside, then you can immediately switch to Plan B (turn off idle servers, pour notifications, etc.).

Additional advantages of external monitoring include checking the real-time response on the server side (or real network delays). You can also configure notifications for this parameter. As an additional opportunity in the case of using CDN: you can track the total load time of the service pages and disable or enable CDN for different regions.

PS review article, more about the architecture of monitoring large projects. I will tell you about specific applied things in the following articles.

Tags: