Functional monitoring in Yandex

    Do you monitor your services in production? Whose area of ​​responsibility do you have?

    Often, when it comes to monitoring, server developers, system administrators and DBAs come to mind who should monitor the data processing queues, the availability of free disk space, the viability of individual hosts and the load.
    Such monitoring really gives a lot of information about the service, but it does not always show how the service works for a real user. Therefore, in addition to system monitoring, we created in Yandex a system of functional monitoring that monitors the status of the service through the end interfaces - through how the application looks and works in the browser, and how it works at the API level.
    What is functional monitoring in our understanding? To better understand this, let's look at how everything developed.

    It all started, of course, with autotests for regression testing. These self-tests were also launched after release in production to verify the service’s performance in combat conditions. The fact that the regression tests launched in production sometimes find bugs made us think.

    What is this and why


    Why do functional tests written for regression testing and successfully completed in testing find problems in production?
    We have identified several reasons:

    • Differences between test environment configuration and production.
    • Problems with internal or external data providers.
    • Iron issues affecting functionality.
    • Problems that occur over time and / or under a specific load.


    If such tests can find problems, we decided that we should try to run them regularly in production and monitor the status of services.
    Let's take a closer look at the problems and tasks that functional monitoring can help solve.

    Data providers


    A good example of a data provider dependent page is the Yandex homepage.
    Weather and News, the Poster and the TV program, even a photo of the day with a search digest are data from external and internal suppliers.
    For example, in Arkhangelsk the Poster block once looked like this:
    image
    Whereas in Murmansk everything was in order
    image

    This happened because the supplier did not send data for Arkhangelsk (or the import on our side was not updated). Sometimes this problem is one-time in nature, and in some cases KPIs can be formulated by the percentage of available data and their freshness.

    Iron problems


    An important role in the work of our services is their fault tolerance and performance. Therefore, teams create services with a distributed architecture and load balancing mechanisms. Failure of individual "glands", as a rule, does not affect users, but large-scale problems with data centers or the routing between them are noticeable in the end interfaces.
    Functional monitoring, which complements system monitoring in this work, allows tracking the connection between hardware problems and functionality.
    For example, in Yandex.Direct there was a case when a slowly "dying" server caused a gradual degradation of the service, and its inaccessibility from some regions. Functional monitoring in this case served as a trigger for an emergency investigation and to identify the root of the problem.
    Another interesting example is the exercises held in our company. During the exercises, one of the data centers is deliberately disconnected to make sure that this does not affect the serviceability of the services and to identify possible problems in time. Disabling one data center does not damage the services, and functional monitoring also helps to monitor the situation during such outages.

    Service degradation over time


    Actual operating conditions of the application in production sometimes create unforeseen situations. The cause of the problems may be an unforeseen combination of volume, duration and type of load, or, for example, the accumulation of system errors that were not detected at the testing stage. The reason can also be errors in the configuration of the infrastructure, leading to a slowdown of the system or its failure.
    If such problems cannot be identified at the testing stage, it is necessary to quickly recognize them if they occur in production. And here, system and functional monitoring can, complementing each other, find problems and report them.

    So, functional monitoring is a functional self-test, “sharpened” to search for specific problems and constantly launched in production.

    What's inside


    There is a second component of functional monitoring - how the flow of results is processed.
    A large flow of results coming from constantly running tests in production needs to be streamlined and filtered. It is necessary to report problems in a timely manner and at the same time minimize false positives. The problem also arises of integrating information from functional monitoring into a general system for evaluating the serviceability of a service.

    To prevent false positives, our system, implemented on the basis of the Apache Camel framework, allows you to aggregate several sequential results from one test into one event. For example, you can configure filtering 3 out of 5, which allows you to notify of a breakdown only if the test gives an error 3 times in 5 consecutive starts (you can set, for example, filtering 2 out of 2 or remove filtering - 1 out of 1). It is also important how often the test is run, so that the delay with such filtering is acceptable.

    On different projects, the consumers of these monitoring are different: somewhere, the monitoring results are closed to testers, it’s interesting for managers to know about individual monitoring, somewhere, the results are integrated into a common system.

    Recipe


    The idea of ​​functional monitoring is very simple, and such monitoring can be very effective.
    Recipe:
    1. Analyze which parts of your service can break down in production and why.
    2. Write (or select from existing) autotests for this functionality.
    3. Run these tests in production as often as you need and as the test run systems allow.
    4. Process the results and notify of breakdowns, compare with other sources of information about the life of the service.

    PS: For a long time we have been wondering how widespread the idea of ​​functional monitoring is and how it is applied in other companies. Someone speaks of such an approach as a matter of course, someone, having learned, decides to implement it for themselves, and someone considers such monitoring unnecessary in the presence of system monitoring.
    And how do you track the status of your services in production, what tools and their combinations do you use?

    Also popular now: