We collect and analyze logs using Lumberjack + Logstash + Elasticsearch + RabbitMQ

Good day.

Logs are often and absolutely not deservedly deprived of the attention of developers. And when programmers need to parse log files, sometimes from several dozen servers at the same time, this task rests with system administrators and consumes a lot of time and effort.

Therefore, when problems occur and you need to find an error in the logs of some service, it all depends on how strong the admin is and how much programmers know what to look for. And also from whether logrotate worked and whether it crashed old data ...

In such situations Logstash helps a lot . It has been actively developing for the last year, has already made a lot of noise, and on the hub, for example here and here, enough articles on how to put it and make it work on one server. In this article, I will touch Logstash as a processing server, because, as an agent, it did not suit us in some respects.

Why? Maxifier is a SaaS product with customers in the USA, Brazil, several European countries and Japan, so we have about a hundred servers scattered around the world. For operational work, we need to have convenient access to the logs of our applications and a quick search for errors in them in case of problems on third-party services / api, the appearance of incorrect data, etc. By the way, The Guardian (one of our customers) uses a similar log-building system .

After several cases of collecting logs by Rsync from many servers, we thought about an alternative that is less long and laborious. And recently, we developed our own log assembly system for various applications. I will share my own experience and description of how it works.

Our target applications for collecting logs:
1. Daemons in Java (Jetty / Native)
2. Daemons in Node.js
3. Dynamically launched Java applications (for example, cluster nodes)

Errors and exceptions with multi-line stack traces are critical for programmers, flying with such errors.
Therefore, the main requirements were:

1. Good support for multi-line messages.
2. Ability to handle different message formats.
3. Easy configuration of data sources.

Without going deep into the configuration part, I will talk about the components and why they are present in the circuit.
In general, our log collection scheme looks like this:

The diagram shows that we do not have a logstash that would monitor application files on application servers. We considered that it is inefficient to use a daemon that consumes 128 MB of memory and a significant part of the processor’s capacities while logging, and forcibly limiting resources here would look like a crutch. Therefore, we began to look for other, less demanding solutions to the problem of transporting logs.

Agents

Rsyslog is a ready-made logging daemon and can seemingly do everything, but the point is always in the details.

Advantages:
1. Fast.
2. Able to track files.
3. Able to ~~dig and not dig to~~ encrypt and not encrypt.
4. Already is in the system.

Disadvantages:
1. Does not know how to forward multiline messages (All attempts on the server to receive a multiline message were completely in vain).
2. Does not know how to track files by mask.
3. Poorly handles file truncation.

As already mentioned, support for multi-line messages was important to us, so we continued to search, but temporarily connected the Rsyslog log assembly, so that later we could organize a search on them.

Lumberjack(now known as logstash-forwarder) is an experimental daemon for sending logs.

Advantages:
1. Fast.
2. Minimalistic (Unix-way).
3.Able to use masks.
4. Consumes a minimum of resources (written in Go).

Disadvantages:
1. Experimental.
2. It does not independently support multiline messages.

Logstash (only as an agent)

Advantages:
1. The same software for collecting logs everywhere.
2. Enough productive.
3. Knows all that we need.

Disadvantages:
1. Java on all servers.
2. High memory requirements, can potentially affect the operation of other services.
3. Difficult to configure.

Agent Selection

We decided to test Lumberjack, and if the result does not work out, roll back to the Logstash option on each server.
Since Logstash and Lumberjack were developed by the same people , in addition we got nice buns: for example, additional fields - with them we were able to recognize the type of application logs without parsing file names.

We tested this daemon, and the only bug we found in it was that when the place runs out, it cannot write the state file and transmits repeated messages from the previous place or it may not notice that the file was cleared. We usually do not allow this, and everything comes back to normal when the place appears.

In general, we liked Lumberjack as an agent. The configuration is extremely simple, able to search by mask and perfectly tracks a large number of files. Moreover, there is a good module for Puppet, with which we configured our servers without any problems. In parallel with the choice of agent, we tested logstash - then it was in version 1.1.12.

Logstash

First of all, we tried to process multiline messages with rsyslog first, then from lumberjack.
At first we tried the big regular expressions that Google gives on the GitHub Gist, but in the end we came to the conclusion that the best option - everything that does not fall under our timestamp, is considered part of the previous message. Moreover, Lumberjack did not modify the messages from the service.

The result is such a simple pattern:

filter {
  multiline {
      pattern => "(^\d{4}-\d{2}-\d{2} \d{2}:\d{2}:\d{2})"
      negate => true
      what => "previous"
   }
}

We set up one Logstash instance to process incoming logs, gave it several threads and started connecting servers. Everything seemed to be fine, but they began to notice that from time to time the server dropped connections, rested on memory and periodically crashed. Even at 16 Gb in Xmx did not give the desired effect.

We decided to distribute the load on two types of instances, and in general did not fail.

The first type of instances (1) only listens to tcp connections and saves raw files (it is convenient to debug filters on them), glues multiline messages and sends them to the bus. It completely copes with its task, leading to the loading of 100% of two cores at peak times (our peak is 2500-3000 messages per second).

The second type of instances (2) takes messages from the bus, divides them into fields, adds them to elasticsearch and, just in case, to files. And, as it turned out, the instances of the second type were the most problematic - either the memory will cease to be enough, the elasticsearch will start to rush. But thanks to the bus, data was not lost, and in instances of the first type (1), logs continued to be collected.

Unloading the instance of the first type led to the fact that the monitoring server as a whole consumes a minimum of resources and at the same time is able to withstand a fairly large number of messages per unit time on its own. Filtering instances (of the second type) may be limited in resources and you may not really care about their performance.

This bunch worked for us for about a month, until we met very strange behavior. Lumberjack agents on servers started feverishly reconnecting to Logstash and immediately fell off with a message saying that they couldn’t reach the listening Logstash and, in general, there was an i / o timeout, while the listening Logstash seemed alive and even sent messages to the bus .
Zabbix reported that PermGen is being downloaded from the process, we increased the memory and restarted, literally 10 minutes later the situation repeated.

We began to figure out which of the servers generates messages, causing Logstash to stupor. After a short search (by the way, it is easily searched by the one who sent the messages the most - Lumberjack reports to Syslog about this), a ~ 20 Gb log was detected on one of the servers, and it included several lines of 2-3Gb each. It is not difficult to guess what was happening: Logstash slowly formed this line in his memory, until he ran into it and then he was already trying to do something about it. For such cases, we added a separator to filter such messages and asked the developers not to do this anymore.

There were problems with too detailed logs, for example Node.js logger bunyan, at the output it had up to 30-40 (!) fields, naturally in the interface it looked awful. Here we reasoned simply:
1) removed unnecessary fields;
2) redid the priority of messages in a human-readable form.

Elasticsearch

Did not cause big problems during the period of use. The 3-gig Elasticsearch instance normally processes indexes and 5-6 GB of archived data per day. The only thing that sometimes happens is OutOfMemory when they try to stuff a large message or ask for a query that is too complex. But no one canceled JMX monitoring, and Zabbix normally notifies that the GC is somehow working for a long time and it would be nice to take a look at the application.

Interface

The Kibana web interface in its original form turned out to be, to put it mildly, difficult even for programmers, and even more so for people far from IT (our analysts also sometimes look at the logs). We took the original dashboard as a basis, customized it to our needs and replaced it with the default dashboard in Kibana. Now it looks like this:

In especially difficult cases, you have to additionally fine-tune the filters and multi-search, but this rarely happens and does not cause big problems.

Brief summary

1.Lumberjack - great for the role of a log collection agent.
2. Logstash - normally stacktrace parses and absolutely different logs and there are no problems with flexibility, but you need to adapt to your requirements.
3. Kibana - again, it is necessary to adapt to the consumer. Sometimes even simple renaming of fields simplifies the work.

Of course, there are cases that cannot be covered by this scheme, for example, when you just need to see the specific launch of the application and what happened, the web interface here is probably not an assistant, although it certainly tastes and color.

Result

The described system of assembling logs allowed to remove part of the load from admins and significantly reduced the waste of their time and nerves. And it’s much more convenient for programmers - now that an error occurs, we simply give developers links to the logs in the web interface instead of an inconvenient heap of log files. Now the system processes more than 1 million messages per day from more than 300 application instances, without creating a noticeable load. This is not the limit of our needs, most likely, the figure will increase at least twice, and there is confidence that the system will be able to accept such a load.

Update

Logstash listener

input {
 lumberjack {
    charset => "CP1252"
    port => 9999
    ssl_certificate => "/etc/ssl/changeme.crt"
    ssl_key => "/etc/ssl/changeme.key"
	}
}
filter {
  multiline {
      pattern => "(^\d{4}-\d{2}-\d{2} \d{2}:\d{2}:\d{2})|(^\d+\))|(^{\")"
      negate => true
      what => "previous"
   }
}
output {
file {
  path => "/logs/raw/%{host}/%{+YYYY-MM-dd}.log"
}
rabbitmq {
  durable => true
  exchange => "logstash-exchange"
  exchange_type => "direct"
  host => "127.0.0.1"
  key => "logstash-routing-key"
  password => "changeme"
  persistent => true
  user => "logstash"
  vhost => "logstash"
 }
}

Logstash filter

input {
 rabbitmq {
  auto_delete => false
  charset => "CP1252"
  durable => true
  exchange => "logstash-exchange"
  exclusive => false
  host => "127.0.0.1"
  key => "logstash-routing-key"
  password => "changeme"
  prefetch_count => 4
  queue => "logstash-indexer"
  type => "logstash-indexer-input"
  user => "logstash"
  vhost => "logstash"
 }
}
filter {
#ugly hack
date { match => ["@timestamp", "ISO8601"] }  
mutate {
  gsub => ['host', '.example.com', '']
}
#just in case2 types of applications 
if [type] == "custom" {
grok { match => ["message", "(?m)(?<date>\d{4}-\d{2}-\d{2}) (?<time>\d{2}:\d{2}(?::\d{2}(?:\.\d+)?)?) \[%{LOGLEVEL:level}\s?\] - (?<mxf_raw_message>.*)"]  }
grok { match => ["message", "(?m)(?<mxf_datetime>\d{4}-\d{2}-\d{2} \d{2}:\d{2}(?::\d{2}(?:\.\d+)?)?) "]  }
mutate { replace => ["mxf_datetime", "%{date}T%{time}Z" ]   }
date {
   match => ["mxf_datetime", "ISO8601"]
   target => "mxf_timestamp" 
} 
mutate { remove_field => ["mxf_datetime"] }
grep {
    match => [ "message", "\|" ]     
    add_tag => "have_id" 
    drop => false
 }
if"have_id" in [tags] { 
grok {   match => ["message", "\[@%{HOSTNAME:mxf_host} %{GREEDYDATA:mxf_thread} (?<mxf_class>[a-z]{1}[.]{1}[a-z]{1}[.]{1}[a-zA-Z]{1}.*):%{NUMBER:mxf_classline}\|(?<mxf_id>.*):%{NUMBER:mxf_linenum}\]$"]  }
 #ugly hack for correct mapping
mutate {     convert => [ "mxf_linenum", "integer" ]   } 
} else {
 grok {   match => ["message", "\[@%{HOSTNAME:mxf_host} %{GREEDYDATA:mxf_thread} (?<mxf_class>[a-z]{1}[.]{1}[a-z]{1}[.]{1}[a-zA-Z]{1}.*):%{NUMBER:mxf_classline}\]$"]  }
}
 grok {   match => ["message", "(?m).*\] - (?<mxf_message>.*)\[@.*"]  }
 grok {   match => ["file", "\/opt\/apps\/%{GREEDYDATA:mxf_version}\/logs\/%{GREEDYDATA:mxf_account}\/%{WORD:mxf_logfilename}.log$"]  }
 mutate { gsub  => ["mxf_aiid","\|",""] } 
}
if [type] == "nodejs" {
json { source => "message"  }
mutate { rename => ["msg","mxf_message"] }
mutate { rename => ["args.0.code","mxf_response_code"] }
mutate { rename => ["data.service","mxf_service"] }
mutate { rename => ["data.method","mxf_method"] }
mutate { rename => ["t","mxf_request_time"] }
mutate { rename => ["pid","mxf_pid"] }
mutate { rename => ["err.name","mxf_errname"] }
mutate { rename => ["err.stack","mxf_stacktrace"]}
mutate { rename => ["err.message", "mxf_errmsg"] }
mutate { remove_field => ["0", "1", "2", "args", "http", "hostname", "msg", "v", "data", "err","t", "pid","io"] }
if [level] == 60 { mutate { replace => ["level","FATAL"] } } 
if [level] == 50 { mutate { replace => ["level","ERROR"] } } 
if [level] == 40 { mutate { replace => ["level","WARN"] } } 
if [level] == 30 { mutate { replace => ["level","INFO"] } } 
if [level] == 20 { mutate { replace => ["level","DEBUG"] } } 
if [level] == 10 { mutate { replace => ["level","TRACE"] } } 
}
}
output {
 elasticsearch_http {
  host => "127.0.0.1"
 }
}

Kibana config

Tags: