AndreiYemelianov April 15, 2016 at 11:39

Introduction to Riemann: Event Monitoring and Analysis

Tutorial

In previous articles, we have repeatedly touched on the issues of monitoring, collecting and
storing metrics (see, for example, here and here ). Today we would like to return to this topic again and talk about an unusual, but very interesting tool - Riemann .

Compared with other monitoring systems, it is characterized by increased complexity,
and at the same time - much more flexibility and fault tolerance. On the Internet, we have seen publications where Riemann is described as "the most flexible monitoring system in the world." Riemann is well suited for collecting real-time information about complex complex high-load systems.

Strictly speaking, the monitoring system in the strict sense of the Riemann is not. It would be more correct to call it an event processor.
It collects information about events from hosts and applications, combines events into a stream and passes them to other applications for further processing or storage. Riemann also monitors the status of events, which allows you to create checks and send notifications.

Riemann is distributed free of charge under the Eclipse license . Most of the code was written by Kyle Kingsbury, also known under the pseudonym Aphyr (by the way, we recommend reading his blog: there are often interesting materials).

Real-time event processing

The growing interest in the issues of monitoring, collection, storage and analysis of metrics, which we have been observing recently, is quite explainable: computer systems are becoming more complex and more heavily loaded. In the case of highly loaded systems, the ability to monitor events in real time is of particular importance. Actually, Riemann was created in order to solve this problem.

The idea of processing events in a regime close to real time is not new: the first attempts to implement it were made back in the late 1980s. An example is the so-called Active Database Systems (active database systems), which performed a specific set of instructions if the data coming into the database corresponded to a given set of conditions.

In the 1990s, data stream management systems ( Data Stream Management Systems ) appeared that could already process incoming data in real time, and complex event processing systems ( Complex Event Processing , abbreviated CEP). Such systems could either detect events based on external data and embedded internal logic, or perform certain analytical operations (for example, count the number of events over a certain period of time).

Examples of modern tools for handling complex events can serve, in particular, Storm (see also the article about it in Russian ) and Esper. They are focused on data processing without storage. Riemann is a product of the same class. In contrast to the same Storm, it is much simpler and more logical: the entire logic of event processing can be described in only one configuration file.
Many system administrators and practitioners can be scared away by this feature: the configuration file is essentially Clojure code , but Riemann also wrote it.

Clojure refers to functional (and even more precisely - lisp-like) programming languages, which in itself is alarming. However, there is nothing to worry about: for all its originality, Clojure is not as complicated as it seems at first glance. Consider its features in more detail.

A bit about Clojure

Clojure is a functional language based on LISP. Programs written in Clojure run on the JVM platform. The first version of this language appeared in 2007. Most recently, the latest version to date was released - 1.8.0.

Clojure is used in projects of companies such as Facebook, Spotify, SoundCloud, Amazon and others (see the official website for a complete list ).

Unlike other LISP implementations for the JVM (for example, ABCL or Kawa), Clojure is not fully compatible with either Common Lisp or Scheme, however, it borrowed a lot from these languages. Clojure also has some enhancements that are not found in other modern LISP dialects: data immutability, competitive code execution, etc.

Since Clojure was originally designed to work with the JVM, it can work with the many libraries that exist for this platform. Interaction with Java is implemented in both directions. You can call code written for Java. It is also possible to implement classes available for calling from Java and other JVM-based programming languages - for example, for Scala. You can read more about Clojure and its capabilities in this article , as well as on the official Riemann website . We also recommend that you familiarize yourself with another brief, but very informative introduction to Clojure .

Installation and first start

To work with Riemann, we first need to install all the necessary
dependencies: Java and Ruby (some additional components are written on it, which will be discussed below):

$ sudo apt-get -y install default-jre ruby-dev build-essential

Next, download and install the latest version of Riemann:

$ wget https://aphyr.com/riemann/riemann-0.2.10_all.deb
$ dpkg -i riemann-0.2.10_all.deb

Next, execute:

$ sudo service riemann start

For full-fledged work, we will also need to install components written in Ruby for collection and metrics:

$ gem install riemann-client riemann-tools

That's all. Everything is ready to start working with Riemann. Before moving on to the practical part, we make a small theoretical digression and clarify the meaning of the most important concepts: events, flows, and index.

Events, Streams, and Index

The basic concept in Riemann is an event. Events can be processed, counted, collected and exported to other programs. An event may look like this:

{:host riemann, :service riemann streams rate, :state ok, :description nil, :metric 0.0, :tags [riemann], :time 355740372471/250, :ttl 20}

The given event consists of the following fields:

: host - host name;
: service - the name of the observed service;
: state - event state (ok, warning, critical);
: tags - event labels;
: time - time of the event in Unix Timestamp format;
: description - a description of the event in any form;
: metric - metric associated with the event;
: ttl - time of the event relevance (in seconds).

Some events may also have custom fields that can be added either during creation or during event processing (for example, fields with additional metrics).
All events are combined into threads. A stream is a function to which an event can be transmitted.

You can create an unlimited number of threads. Events pass through streams but are not stored in them. However, very often there is a need to monitor the status of events - for example, they have lost their relevance or not. For this, an index is used - a table of states of monitored events. In the index, events are sorted by groups by host and by service, for example:

:host www, :service apache connections, :state nil, :description nil, :metric 100.0, :tags [www], :time 466741572492, :ttl 20

This is an event that occurred on the host www in the apache connections service. The index always stores the latest event at the moment. Indexes can be accessed from streams and even from external services.

We have already seen that each event contains a TTL (time to live) field. TTL is the time span during which an event is relevant. In the example just shown, the TTL of the event is 20 seconds. The index includes all events with the parameters: host www and: service apache connections. If no such events occur within 20 seconds, a new event will be created with the value expired in the state field. Then it will be added to the stream.

Configuration

Let's move from theory to practice and get down to configuring Riemann. Open the configuration /etc/riemann/riemann.config. It is a Clojure program and looks like this by default:

; -*- mode: clojure; -*-
; vim: filetype=clojure
(logging/init {:file "/var/log/riemann/riemann.log"})
; Listen on the local interface over TCP (5555), UDP (5555), and websockets
; (5556)
(let [host "127.0.0.1"]
  (tcp-server {:host host})
  (udp-server {:host host})
  (ws-server  {:host host}))
; Expire old events from the index every 5 seconds.
(periodically-expire 5)
(let [index (index)]
  ; Inbound events will be passed to these streams:
  (streams
    (default :ttl 60
      ; Index all events immediately.
      index
      ; Log expired events.
      (expired
        (fn [event] (info "expired" event))))))

This file is divided into several sections. Each section begins with a comment, denoted, as is customary in Clojure, with a semicolon (;).

The first section indicates the file in which the logs will be written. Next comes the section with the interfaces. Riemann usually listens on the TCP, UDP, and web socket interfaces. By default, they are all bound to the local host (127.0.0.1).

The following section contains settings for events and index:

(periodically-expire 5)
(let [index (index)]
  ; Inbound events will be passed to these streams:
  (streams
    (default :ttl 60
      ; Index all events immediately.
      index

The first function (periodically-expire) removes from the index all the events that have expired, and assigns them the status expired. Event cleaning starts every 5 seconds.

By default, Riemann copies the fields: service and: host to expired events. Other fields can also be copied; To do this, use the: key-keys option with the periodically-expired function. So, for example, we can instruct to save not only the host name and service name, but also the tags:

(periodically-expire 5 {:keep-keys [:host :service :tags]})

The following is a construction in which we define a symbol named index. The value of this character is index, i.e. This is a function that sends events to the index. It is used to tell Riemann when to index an event.

Using the streams function, we describe streams. Each thread is a function that takes an event as an argument. The streams function tells Riemann: “here is a list of functions that need to be called when new events are added.” Inside this function we set TTL for events - 60 seconds. To do this, we used the default function, which takes a field from an event and allows you to set a default value for it. Events that do not have TTL will receive expired status.

Then the default configuration calls the symbol index. This means that all incoming events will be added to the index automatically.

The final section contains an instruction to log events with the status expired:

; Log expired events.
      (expired
        (fn [event] (info "expired" event))))))

We’ll make some changes to the configuration file. In the section on network interfaces, replace 127.0.0.1 with 0.0.0.0 so that Riemann can receive events from any host.

At the very end of the file, add:

;print events to the log
(streams
  prn
  #(info %))

This is a prn function that will write events to the logs and to standard output. After that, save the changes and restart Riemann.

In a situation where you have to monitor the work of multiple servers, you can create not a common configuration file, but a whole directory with separate files for each server or group of servers (see recommendations in this article ).

Detailed instructions for writing the configuration file can be found here .

Sending data to Riemann

Now let's try to send data to Riemann. We will use the riemann-health client for this, which is part of the riemann-tools package that we already installed. Let's open one more tab of the terminal and execute:

$ riemann-health

This command transmits Riemann data on the state of the host (CPU load, amount of disk space used, amount of memory used).
Riemann will start accepting events. Information about these events will be written to the file /var/log/riemann/riemann.log. It is presented in the following form:

#riemann.codec.Event{:host "cs25706", :service "disk /", :state "ok", :description "8% used", :metric 0.08, :tags nil, :time 1456470139, :ttl 10.0}
INFO [2016-02-26 10:02:19,571] defaultEventExecutorGroup-2-1 - riemann.config - #riemann.codec.Event{:host cs25706, :service disk /, :state ok, :description 8% used, :metric 0.08, :tags nil, :time 1456470139, :ttl 10.0}
#riemann.codec.Event{:host "cs25706", :service "load", :state "ok", :description "1-minute load average/core is 0.02", :metric 0.02, :tags nil, :time 1456470139, :ttl 10.0}

Riemann-health is just one of the utilities in the riemann-tools package. It includes a fairly large number of utilities for collecting metrics: riemann-net (for monitoring network interfaces), riemann-diskstats (for monitoring the I / O subsystem), riemann-proc (for monitoring processes in Linux) and others. A complete list of utilities can be found here .

Create the first check

So, Riemann is installed and running. Now let's try to create the first check. Open the configuration file and add the following lines to it:

(let [index (index)] 
  (streams 
    (default :ttl 60 
      index 
   ;#(info %) 
   (where (and (service "disk /") (> metric 0.10)) 
    #(info "Disk space on / is over 10%!" %))

The function (#info) is preceded by a comment sign - a semicolon (;). This is to prevent Riemann from logging every event. Next, we describe the where stream. Events that match a given criterion fall into it. In our example, there are two such criteria:

field: service should be set to disk /;
field value: metric must be greater than 0.10 or 10%.

Then they are transferred to the child stream for further processing. In our case, information about such events will be recorded in the file /var/log/riemann/riemann.log.

Filtering: Quick Reference

Without filtering events, a full-fledged work with Riemann is impossible, so it’s worth saying a few words separately.

Let's start by filtering events using regular expressions. Consider the following example of a where stream description:

where (service #”^nginx”))

In Clojure, regular expressions are indicated by the # sign and are enclosed in double quotes. In our example, expressions with the name nginx in the: service field will get into the where stream.

Events in the where stream can be combined using logical operators:

(where (and (tagged "www") (state "ok")))

In this example, events with the www tag and the value ok in the state field will fall into the where stream. They combine with events from the tagged stream.
Tagged is the short name of the tagged-all function, which combines all events with the specified tags. There is also a tagged-any function - it combines events marked by one or more of the specified tags into a stream:

(tagged-any ["www" "app1"] #(info %))

In our example, events tagged with www and app1 tags will fall into the tagged stream.

In relation to events, you can perform mathematical operations, for example:

(where (and (tagged "www") (>= (* metric 10) 5)))

In this example, events will occur events with the tag www, for which the field value: metric, multiplied by 10, will be more than 5.
A similar syntax can be used to select events whose values in the field: metric fall in the specified range:

(where (and (tagged "www") (< 5 metric 10)))

In the above example, events with the www tag, for which the value of the field: metric is in the range of 5-10, will fall into the where stream.

Set up notifications

Riemann may send notifications if it meets specified audit conditions. Let's start by setting up email notifications. Riemann uses the email function for this:

[

(def email (mailer {:from "riemann@example.com"}))
(let [index (index)]
; Inbound events will be passed to these streams:
(streams
  (default :ttl 60
    ; Index all events immediately.
    index
    (changed-state {:init "ok"}
      (email "andrei@example.com")))))

Notifications are sent to Riemann on the basis of a special library on Clojure - Postal . By default, the local mail server is used for distribution.
All messages will be sent from an address like riemann@example.com.

If the local mail server is not installed, Riemann will display error messages of the form:

riemann.email$mailer$make_stream threw java.lang.NullPointerException

In the above code example, we used the changed-state shortcut and thus indicated that Riemann should track events whose status has changed. The value of the init variable tells Riemann what the initial state of the event was. All events whose status has changed from ok to any other will be passed to the email function. Information about such events will be sent to the specified email address.
For more detailed examples of setting up notifications, see James Turnbull , one of the developers of Riemann.

Metric Visualization: riemann-dash

Riemann has its own tool for visualizing metrics and building simple dashboards - riemann-dash. You can install it like this:

$ git clone git://github.com/aphyr/riemann-dash.git
$ cd riemann-dash
$ bundle

Riemann-dash starts with the command:

$ riemann-dash

The riemann-dash homepage is available in the browser at the address [server ip-address]: 4567: Let us bring

up the black Riemann inscription in the very center, press the Ctrl key (on Mac - cmd) and click on it. The caption will be grayed out. After that, press the E key to start editing:

riemann-dash

in the title drop-down menu, select the Grid item, and in the query field write true:

riemann-dash

After setting the necessary settings, click the Apply button:

riemann-dash

The dashboard is not very aesthetic and convenient, but quite visual. The inconvenience, however, is offset by the fact that with Riemann you can use third-party visualization tools, d in particular Graphite and Grafana - the interested reader can easily find relevant publications on the Internet. And the procedure for setting up the Riemann + InfluxDB + Grafana bundle will be described in the next section.

Sending data to InfluxDB

The undoubted advantage of Riemann is its wide integration capabilities. Metrics collected using it can be sent to third-party stores. Below we show how to integrate Riemann with InfluxDB and customize data visualization using Grafana.

Install InfluxDB:

$ wget https://s3.amazonaws.com/influxdb/influxdb_0.9.6.1_amd64.deb
$ sudo dpkg -i influxdb_0.9.6.1_amd64.deb

You can read more about configuring InfluxDB in the official documentation , as well as in one of our previous articles .

After the installation is complete, run the command:

$ sudo /etc/init.d/influxdb start

Then create a database for storing data from Riemann:

$ sudo influx
>CREATE DATABASE riemann

Let's create a user for this database and set a password for it:

>CREATE USER riemann WITH PASSWORD ‘пароль пользователя riemann’
>GRANT ALL ON riemann TO riemann

That's it, the installation and basic configuration of InfluxDB is complete. Now you need to register the necessary settings in the Riemann configuration file (the code is taken from here and slightly modified):

; -*- mode: clojure; -*-
; vim: filetype=clojure
;подключаем capacitor, клиент для работы с InfluxDB
(require 'capacitor.core)
(require 'capacitor.async)
(require 'clojure.core.async)
(defn make-async-influxdb-client [opts]
    (let [client (capacitor.core/make-client opts)
          events-in (capacitor.async/make-chan)
          resp-out (capacitor.async/make-chan)]
        (capacitor.async/run! events-in resp-out client 100 10000)
        (fn [series payload]
            (let [p (merge payload {
                    :series series
                    :time   (* 1000 (:time payload)) ;; s → ms
                })]
                (clojure.core.async/put! events-in p)))))
(def influx (make-async-influxdb-client {
        :host     "localhost"
        :port     8086
        :username "riemann"
        :password "пароль пользователя riemann"
        :db       "riemann"
    }))
(logging/init {:file "/var/log/riemann/riemann.log"})
(let [host "0.0.0.0"]
  (tcp-server {:host host})
  (udp-server {:host host})
  (ws-server  {:host host}))
(periodically-expire 60)
(let [index (index)]
  (streams
        index
        (fn [event]
            (let [series (format "%s.%s" (:host event) (:service event))]
                (influx series {
                    :time  (:time event)
                    :value (:metric event)
                })))))

Save the changes and restart Riemann.

After that install Grafana:

$ wget https://grafanarel.s3.amazonaws.com/builds/grafana_2.6.0_amd64.deb
$ sudo dpkg -i grafana_2.6.0_amd64.deb

We won’t give detailed instructions on how to set up Grafana, and there’s no special need for it: the corresponding publications can be easily found on the Internet.

Grafana's homepage will be available in a browser at http: // [Server IP]: 3000. Next, you just need to add a new data source (InfluxDB) and create a dashboard.

Conclusion

In this article, we have provided a brief overview of the capabilities of Riemann. We covered the following topics:

Clojure language features
installation and initial setup of Riemann;
structure of the configuration file and features of its syntax;
creating checks;
notification settings;
visualization of metrics with riemann-dash
Riemann integration with InfluxDB and visualization of metrics using Grafana

If you think that we have missed any important details - write to us and we will supplement the review. And if you use Riemann in practice, we invite you to share your experience in the comments.

If you for one reason or another can not leave comments here - welcome to our corporate blog .

Tags: