Project Monitoring: Comparative Analysis of Existing Solutions



    We bring to your attention a publication based on the speech of Anton Baranov , ITSumma, at the BitrixSummerFest summer conference .

    In this post I will tell you what you need to monitor in the projects from the very beginning, including highly loaded ones. We will analyze aspects related to open source monitoring systems, in particular Zabbix and Graphite, as well as a look at several SaaS monitoring solutions.

    So, from the very beginning you need to decide for yourself which system parameters you need to monitor. Conventionally, there are three areas of monitoring:

    1. Resource consumption.
    2. The work of server software.
    3. Business logic application.

    Resources


    They are divided into two types: critical and important. Critical resources are those whose lack is guaranteed to lead to the inoperability of your application. For example, if the processor is fully loaded, then new requests will not be processed quickly. And this can lead to users leaving.

    Also critical resources include RAM, disk subsystem bandwidth, network bandwidth. With a lack of each of these resources, the responsiveness of your service can drop so much that you begin to lose your audience.

    Important resources are those whose dynamics of changes suggest that problems will soon arise with the project. For example, if you are starting to use a swap file, then in most cases this indicates a lack of RAM. In addition, the use of a swap file reduces system performance, and as a result, this can lead to inaccessibility of the project for users.

    The status of RAID, that is, the disk subsystem, the status of the "health" of the drives themselves, may indicate an imminent failure and loss of information. Also an important metric is IOPS - the number of I / O operations per second. Pay attention to the AVIO disk - the average response time to disk requests. If, for example, AVIO is about 10 milliseconds, then it's time to think about whether to change a particular drive. It is highly recommended to monitor the number of used sockets on the server, processor temperature and fan speed.

    I especially draw your attention to the temperature of the processor - there are cases that when the processor overheats, the server simply cannot process all incoming requests normally.

    Server software


    Here it is necessary to clearly understand what is being monitored and why. That is, you need to be fully aware of what each parameter is for, which we put on monitoring, for which we set up notifications, for which it is responsible.

    Nginx . For this product, RPS is by far the most important feature. This abbreviation stands for requests per second, that is, the number of requests to your site per second.

    PHP-FPM . Here you also need to monitor RPS, the number of active processes, messages that the pool PHP-FPM has reached the limit on the number of processes.

    Sphinx. This is a full-text search engine on sites. Here you need to monitor the number of requests and Fatal Error. The fact is that in Sphinx versions lower than 3.x in some cases errors can be logged, a log is kept, but the system is working, accepting requests and not responding to them. In this case, we get a very bad situation: everything seems to work fine, requests for Sphinx go, but the site has an empty page with code 200.

    MySQL. It is necessary to monitor the number of queries to the database: select, update, insert. You need to track the number of rows processed. For example, you changed the logic of the project a little and you now use other queries, but you forgot to add indexes. The next parameter to track: column processing speed. Say, if you used to process 10 million per minute and everything was fine, now 120 million columns are processed all of a sudden, disks are running to the limit, everything slows down. So tracking this parameter can make it clear what exactly is the problem after the next deployment.

    It is also necessary to control buffer pool usage and replica status. The status of the replica is very important when a fault tolerant scheme is used. Thus, we monitor that the reserve is in an active state.

    Exim . Monitor the size of the mail server queue. If it grows very fast or does not decrease for a long time, this may indicate that, for example, some mail service has banned our server and considers all letters from it to be spam.

    Business logic


    It is necessary to monitor a group of parameters that relate to the business logic of the application. For some reason, many people forget about them, focusing on the server and software logs. In the project itself, you can monitor the components of business processes. For example, the number of new registrations of users who want to make purchases. In some cases, a decrease in the quantity may also be an indirect evidence of some breakdowns in the project.

    But if the registration process does not work correctly, then the user will not be able to place an order, place a purchase, pay, etc. Therefore, in some situations, when we are not sure that the registration process has not broken down, it makes sense to monitor the entire process from beginning to end. That is, from the moment the user enters his data until he normally logs in and sees his goods in the basket.

    Depending on the adopted business logic on sites that distribute applications, it is necessary to monitor, for example, the number of downloads. In advertising projects, it is recommended to track the number of banner impressions, their views, the ratio of views to clicks, etc.

    An indirect, but no less important parameter is the availability of the services of partner companies used in your project. For example, a Java script that is loaded from a third-party server. If the partner's servers are slowed down, then the Java script will load slowly. This will be especially bad if your site is built so that without this Java script the page loading does not go any further. It turns out that someone else’s uncle has problems, and your users suffer.

    Also, in some situations, statistics on requests with a particular referrer helps. This is very relevant for affiliates: we can understand where and who is driving traffic. If you or your client assume that somewhere in the code an error crept in and the statistics are considered incorrect, then you can correct it by the logs.

    How to monitor complex business logic chains? For example, check the registration form? You can do this:

    • Create a mail server.
    • We are writing a PHP script that goes to the client site and registers as a regular user.
    • After that, the script checks the contents of the box, parses the letter, takes a registration confirmation link from it, goes through it, receives a response, parses it and logs in with the data with which it is registered.
    • At the end of all procedures, the script sends 1 or 0 to the monitoring system, depending on the results of the check. Naturally, all script actions should be logged in detail.

    Problems with the monitoring system


    Monitoring can also break. Therefore, it is very important to monitor the performance of the monitoring system itself. It can be some separate script running on cron, or some other check. But the most important thing is that you need to check that monitoring is available, working, sends notifications, checks values, compares them with notifications you set up, etc. Without this check, at one point, you may find that it seems that the graphs are being drawn, and the service is on its third day.

    Monitoring tools


    Let's start with open source systems, Zabbix and Graphite.

    Zabbix


    The backend of the system is written in PHP; MySQL, PostgreSQL, SQLite or Oracle can be used as a database for storing metrics. On the client side, either a Zabbix agent, or SNMP, or some scripts that send data to Zabbix can be used. By “client” is meant the server that we are monitoring.

    How the monitoring process is built in Zabix:

    • Create a host in Zabbix.
    • Add a server that we will monitor.
    • Install the Zabbix agent on it.
    • We configure in Zabbix templates that need to be monitored, for example, Nginx, MySQL. If there are no templates, you can google them or write them.
    • We configure the agent to send the necessary parameters to Zabbix.
    • We set up a web script to check the URL on the project.

    Nginx Monitoring:

    • We find and import templates, since in Zabbix itself there is no built-in Nginx check.
    • On the client side, we write a bash script that will send data from the Nginx status in the desired format to Zabbix.
    • We write include in the Zabbix agent.

    After all these manipulations, Zabbix starts monitoring Nginx.

    Notifications in Zabbix are built on the principle of the template. Each of them has metrics for monitoring. For example, RPS, number of requests, etc.

    Each metric has a trigger. Say, when monitoring Nginx, we can make a trigger for RPS that will work if RPS = 0 for five minutes. When the trigger is triggered, a notification is sent to the mailbox or to any other notification system that we have chosen.

    In general, Zabbix is ​​quite difficult to configure. Zabbix-server scales as well as any other web-application, that is, we can parallelize databases, web-servers. If desired, you can automate the monitoring of new services. For example, take your favorite configuration management system and configure it so that a new node is automatically added to Zabbix after creation.

    Cacti


    It is an open source system. MySQL is used as a database backend for internal moments. Information about metrics is stored in rrdtool format, that is, all metric data, like Cacti itself, is written in PHP. The client side uses an SNMP daemon that sends data to the Cacti server.

    How the monitoring process is built in Cacti:

    • Install snmppd on the server.
    • We configure rocommunity in snmpd - the identifier of this server.
    • Add a Cacti host.


    As you can see, everything is extremely simple. Difficulties begin further. By default, monitoring is available only for the most basic resources - CPU, memory, traffic. All other parameters (Nginx, MySQL, mail, etc.) are monitored using third-party plugins that are installed additionally. Even for email notifications, a plugin is required. The same applies to monitoring URLs (accessibility of your site), setting up notifications when some values ​​are exceeded, and much more.

    The process of creating an alert looks like this:



    Choose a server, chart and metric. In this graph, for example, we have the metrics System, User, Nice, CPU load. A notification is sent when the CPU User load is exceeded.

    Cacti is not a flexible system. If Zabbix, for example, allows using plugins to send notifications through a variety of systems, including Jabber, Slack, etc., then in Cacti, notifications are sent only via e-mail through a separate plug-in.

    This is how the notification sent by e-mail looks like:



    To summarize. Monitoring in Cacti is carried out either through SNMPD, in which we prescribed sending information to the server through some of our scripts, or using some plug-ins that are installed on the Cacti server itself. Notifications are sent only by e-mail.

    With automation, Cacti is also quite bad: there are only cli-scripts from the command line that run on the server side and have poor flexibility. Cacti scales like a normal web application, that is, you can scale MySQL horizontally, you can install Nginx, followed by several web-nods, etc.

    Graphite


    This open source system is more interesting than Zabbix and Cacti. The server part of the system consists of three parts - Carbon, Whisper and the Graphite web application. The last component is a subsystem for rendering graphs and displaying them on the web.

    • Carbon is a daemon in which monitoring data from customers flows.
    • Whisper is a server-side database in which monitoring statistics are accumulated.
    • Graphite web application is a subsystem that displays a beautiful page with graphs in the browser.

    On the client side, we can install any software that will send Graphite data in the desired format. There are a lot of solutions, you can write your own if the ready ones do not suit you with something.

    Graphite monitoring is extremely simple:

    • We are not installing a client, for example, collectd.
    • We include the necessary plugins.
    • We indicate which parameters should be monitored.
    • We configure them accordingly. For example, we prescribe access to MySQL or to the status of Nginx, and configure the data setting in Graphite.


    
            Host "127.0.0.1"
            User "root"
            Password "sohW2ax0Eenox9t"
            Database "test1"
            MasterStats false
        

    You can monitor anything and anything. If only your application normally sent data to Graphite. It looks like a standard jungle application that draws data with Graphite.



    If you need more beautiful graphics, then you can use other applications.

    Notifications are configured quite flexible. For example, there are solutions that allow you to send notifications via mail, HipChat, Slack, or to your own services. Example notification settings:

    "alerts": [
        {
          "name": "Memory",
          "query": "aliasByNode(collectd.*.memory.memory-free, 1)",
          "interval": "1minute",
          "format": "bytes",
          "rules": ["warning: < 300MB", "critical: < 200MB"]
        },
    


    Everything is done in text format, after which we reboot the notification system, and it picks up new parameters. An example of a notification that came to the mail:



    Graphite is a fairly flexible system that allows you to monitor almost anything. It is necessary to configure mainly on the client side, that is, the part that sends monitoring data to the server. Notifications are customizable. Automation is simple, in text format.

    Graphite scales horizontally. The Whisper storage engine used in it has limitations: with a large number of servers (from several tens), we run into a limitation on the format itself, in which case the backend needs to be changed. But this does not happen often, so initially only Whisper can be used.

    If you have many working servers, and one monitoring server does not cope with all the metrics, then you can use the Graphite-relay solution for scaling.

    SaaS monitoring


    This approach is used when the monitoring server is on the side of the development company.

    As an example, consider the Density server. We register in the service and add our server there. We are given a link where we download the agent and install it on our server.



    We have our server added to the SaaS service and the Python daemon running on our server. We now only need to configure this Python daemon: for example, add data for the Nginx status and access data for MySQL. After rebooting, in the Server Density settings, you need to activate the graph display for MySQL and Nginx.



    When using SaaS monitoring, you can monitor many parameters. For example, the URL of our online store. In this case, the response code for the request, response time, etc. will be monitored. You can create the notifications we need for any parameter.

    Most SaaS monitoring services are structured as follows: there is a certain web interface through which we add a server, install an agent on our server, and that’s all. No more complicated steps are required to configure monitoring from us. Only in a clear web-based interface do we configure notifications according to the parameters we need.

    Which tool to choose?


    The main factors influencing your choice: the number of nodes that need to be monitored and the tasks that you plan to solve. For example, for a small startup with one server, Server Density is ideal. For an organization with a distributed network infrastructure and a large number of servers in different cities, Zabbix is ​​the best solution. For a small office where you need to monitor some basic server performance, Cacti is great.

    Also popular now: