banzayats April 24, 2014 at 18:47

Kale is an open source tool for detecting and correlating anomalies

To monitor the IT infrastructure, we use many tools, including:

Zabbix - a lot of articles have been written about him here on the hub We really like its low-level detection capabilities, but its data visualization capabilities are poor.
Graphite is a system that stores data and has a convenient interface for displaying it. Now we import metrics from Zabbix into it and store the history.
Shinken is a monitoring system that is based on Nagios and written in Python. Now we are looking at her. We like the fact that it is very easy to import data from the Netdot network documentation system (I wrote about it earlier), and it also easily integrates with Graphite.

It is possible to discuss the advantages / disadvantages of various monitoring systems for a long time, but I want to dwell on only one issue - the identification of anomalies . When the number of metrics in your monitoring system is measured in hundreds, it is not difficult to track the abnormal behavior of one or more of them. But when the number of metrics is measured in tens or hundreds of thousands, the issue of automatically detecting anomalies becomes relevant. No administrator or group of administrators can manually track the behavior of a complex system of hundreds of devices.
Etsy engineers once encountered this problem and developed their own tool for detecting and correlating anomalies. It is called Kale and consists of two parts:

Consider each of the parts in more detail.

Skyline

Skyline is a real-time anomaly detection system. It was created in order to conduct passive monitoring of hundreds of thousands of metrics, without the need to configure thresholds for each of them. This means that you do not need to manually set the trigger thresholds in your monitoring system - the system automatically analyzes the data entering it and, based on several algorithms built into it, makes a decision about the “anomalousness” of this data.

skyline

Skyline consists of several components.

Horizon

Horizon is the component that is responsible for collecting data. It accepts data in two formats: pickle (TCP) and MessagePack (UDP).
Thus, you can configure the following chain: the monitoring system sends data in the format "metric_name time_stamp_value" to Graphite. On the graphite side, carbon-relay works , which sends pickle data to Horizon. Then, after receiving the data, Horizon processes it, encodes it using MessagePack and sends it to Redis. Or you can configure the monitoring system so that it sends data to Horizon directly, pre-encoding it in the MessagePack format. In addition, most languages of modern programming languages have modules for working with MessagePack .
Also, Horizon regularly trims and cleans old metrics. If this is not done, then all free memory will soon be exhausted.
In the settings, you can also specify a list of metrics that will be ignored.

Redis

Redis is a separate component, but Skyline will not work without it. In it, he stores all the metrics and encoded time series.
This solution has its advantages and disadvantages. On the one hand, Redis has high performance due to the fact that all data is stored in RAM. The data is stored in the form of “key - value”, where the key is the name of the metric, and the value is the encoded time series corresponding to it. Redis, on the other hand, does a poor job with very long lines. The longer the string, the less performance. As practice has shown, storing data for more than a day / two does not make sense. In most systems, data has hourly, daily, weekly, and monthly frequency. But if you store data for several weeks or months, the performance of Redis will be extremely low. To solve this issue, you can use alternative methods of storing time series, for example, using the libraryredis-timeseries or something like that.

Analyzer

Analyzer - this component is responsible for data analysis. It receives a general list of metrics from Redis, and starts several processes, each of which assigns its own metrics. Each of the processes analyzes the data using several algorithms. One by one, the algorithms analyze the metrics and report the result - the metric is abnormal or not. If most of them report that the metric currently has an anomaly, then it is considered abnormal. If only one or several algorithms "voted", then the anomaly is not counted. You can specify a threshold in the settings - the number of algorithms that must work before the metric is classified as anomalous. By default, it is 6.
Currently, the following algorithms are implemented:

mean deviation (mean absolute deviation);
Grubbs' test
average for the first hour (first hour average);
standard deviation from average;
standard deviation from moving average;
least squares method;
outliers on the histogram (histogram bins);
Kolmogorov consent criterion (K – S test).

Most of them are based on control cards and on the rule of three sigma . You can learn about the principle of operation of some algorithms from reports whose videos are posted at the end of this article. I also want to advise you to get acquainted with the materials of Anton Lebedevich’s blog - mabrek.github.io . He also contributes to the development of Skyline.
Algorithms can be customized, modified, deleted or added new ones. All of them are collected in one algorithms.py file . In their calculations, they use the SciPy and NumPy libraries . About the last there is a good article on Habr.

In addition to abnormal, as a result of data analysis, the following statuses can also be assigned:

TooShort : the time series is too short to draw any conclusions;
Incomplete : the length of the time series in seconds is less than the full period specified in the settings (usually 86400 s);
Stale : the time series has not been updated for a long time (the time in seconds is set in the settings);
Boring : the values of the time series have not changed for some time (set in the settings);

All abnormal metrics fall into the file, based on the data from which the picture is formed in the web application.
Analyzer can also send notifications. As targets currently available: mail, HipChat and PagerDuty .

Flask webapp

To display abnormal metrics, a small web application written in Python using the Flask framework is used. It is extremely simple: in the upper part two graphs are displayed - for the past hour and day. Below the graphs is a list of all abnormal metrics. When you hover over one of the metrics, the picture on the graphs changes. When clicked, the Oculus window opens, which will be discussed further.

Oculus

Oculus serves to search for correlation of anomalies and works in conjunction with Skyline. When Skyline finds the anomaly and displays it in its web interface, we can click on the name of the anomalous metric and Oculus will show us all the metrics that correlate with the original one.

Briefly, the search algorithm can be described as follows. Initially, the initial series of values, for example, a series of the form [960, 350, 350, 432, 390, 76, 105, 715, 715] , is normalized: the maximum is sought - it will correspond to 25, and the minimum - it will correspond to 0; Thus, the data is proportionally distributed in the limit of integers from 0 to 25. As a result, we get a series of the form [25, 8, 8, 10, 9, 0, 1, 18, 18] . Then the normalized series is encoded using 5 words: sdec (sharply down),dec (down), s (exactly), inc (up), sinc (sharply up). The result is a series of the form [sdec, flat, inc, dec, sdec, inc, sinc, flat] .
Then, using ElasticSearch, all metrics that are similar in form to the original are searched. Data in ElasticSearch is stored as:

{
    fingerprint: dec sinc dec sdec dec sdec sinc
    values: 13.18 12.72 14.8 14.43 12.95 12.13 6.87 9.67
    id: mini.Shinken_server.shinken.CPU_Stats.cpu_all_sys
}

First a fingerprint search is performed . The result is a sample, the number of metrics in which is an order of magnitude less than the total number. Further, for analysis, we use the Fast Dynamic Time Transformation (FastDTW) algorithm, which uses values . There is a good article about the FastDTW algorithm .
As a result, we obtain data from all found metrics that correlate with the original.

To import data from Redis, scripts written in Ruby are used. They take all the metrics with the prefix “mini”, decode them, normalize and export to ElasticSearch. During updating and indexing, the speed of the ElasticSearch search is reduced. Therefore, in order not to wait for results for a long time, two ElasticSearch servers in separate clusters are used, between which Oculus regularly switches.

To search and display graphs, a web application is used that uses the Sinatra framework. You can search by the name of the metric, or simply by drawing a curve in a special field:

As a result, we will see a page that displays:

information about search parameters;
encoded representation of the original graph;
a list of collections (see below) if the data correlate with previously stored data;
a list of graphs found, sorted by ascending correlation score (the lower the score, the more data correlate).

Data can be filtered, as well as grouped into a collection , give them a description and stored in memory. Collections are used in the following cases. Suppose we detected some anomaly and got a list of graphs confirming this phenomenon. It will be convenient to save these graphs and give them a detailed description. Now, when a similar problem happens in the future, Oculus will find us this collection and the description written earlier will help us understand the causes and methods of eliminating the identified problem.

The figure below shows the Kale operation scheme:

I will not describe the installation and configuration process, since it is not very complicated and is indicated in the Skyline and Oculus documentation .

At the moment, we launched the system in test mode - all components work on one virtual machine. But even with a rather weak configuration (Intel Xeon E5440, 8 Gb RAM), the system without problems analyzes more than one ten thousand metrics in real time. The main difficulty in operation is the adjustment of the anomaly detection parameters. Errors of the second kind (false negatives) have not yet arisen, but with errors of the first kind (false positives), we regularly meet and try to adjust the algorithms to our needs. The main problem is the seasonality of the data. There is a Holt-Winters method that takes seasonality into account, but it has a number of disadvantages:

only one season is taken into account (in real data there can be more than one season - hour, day, week, year);
more data is needed over several seasons (therefore, this method is not applicable in Skyline - the size of the rows in the Redis database will be too large);
in case of an anomaly, it will be taken into account in future seasons with gradual attenuation;

So the question of developing algorithms for detecting anomalies, in particular, taking into account seasonal data, remains open.

References

github.com/etsy/skyline
github.com/etsy/skyline/wiki
https://groups.google.com/forum/#!forum/skyline-dev
github.com/etsy/oculus
mabrek.github.io

Video

The video shows the performance Abe Stenveya ( of Abe Stanway ), developer Kale .

John Cowie report ( of Jon Cowie ) and Abe Stenveya ( of Abe Stanway ) on the Velocity conference in which they talk about their child -

The second part of the speech.

Report by Anton Lebedevich “Statistics in practice for the search for anomalies in load testing and production”

Tags: