amarao February 25, 2015 at 13:24

Log processing based on previous messages in logstash / elasticsearch

I recently wrote about catching the nuclear MCE (machine check error) and other nasty things using netconsole . Extremely useful thing. One problem: throttling on the CPU due to local overheating (continuous load) is fixed as MCE. A backup happens - and the administrators receive a terrible message about MCE, which in practice means "a little overheated" and definitely does not require attention to itself at 3 a.m.

The ridicule of the problem is that Linux fixes the MCE after throttling has ended. That is, the mode is 'normal', but instead it turns into MCE. It looks like this:

CPU0: Core temperature above threshold, cpu clock throttled (total events = 40997)
CPU4: Core temperature above threshold, cpu clock throttled (total events = 40997)
CPU4: Core temperature / speed normal
CPU0: Core temperature / speed normal
mce: [Hardware Error]: Machine check events logged

In this case, we definitely want to respond to normal MCE. What to do?

As part of logstash, message processing is assumed to be stateless. You see the message - you react. To introduce for the sake of one type of message a more complex system - overkill.

It would seem that there is a filter (not to be confused with output) elasticsearch, which allows you to make queries. Unfortunately, he does not know how to do ifs, that is, remove_tag and add_tag will work regardless of whether the search succeeded or not.

Sad

Second problem: what should the request look like in elasticsearch? We need a time interval relative to the current one, and we need to filter by the current host (that is, the host to which MCE came).

Let's start deciding in turn.

First: request. query cannot contain a normal search query, but must contain a query string that is inside lucene query . Let's start with a fixed request:

“type: netconsole AND host: compute109 AND message: temperature”. (type is set in input when receiving a message from the kernel via netconsole).

Hooray, there are results, but for a large interval, which is fraught with false positives. But what about @ timestamp? The query language help is quite modest, and no math or special variables are provided there. But ranges are supported with the help of a record like "[from TO to]".

Then I had an attack of intuition and I wrote in the request@ timestamp:[now-2h TO now], although there is not a word about 'now' in the documentation. And they understood me. Checking such queries is better in kibana in the "query" field. A quick check showed that now-1m also rolls as time.

So, the request has become more enjoyable: type:netconsole AND host:compute109 AND message:temperature and @timestamp:[now-1m to now]...

It would seem that everything will be fine. But it didn’t work. Although it should have been. And then I send WTF rays to the creators of lucene, logstash and elasticsearch. Because and and AND are six different letters.

Correctly (and only so!) type:netconsole AND host:compute109 AND message:temperature AND @timestamp:[now-1m to now]. Otherwise, the request begins to search for messages (message) containing any words temperature, and, @ timestamp).

The question remains: where is the host coming from? Logstash already helps here - it can substitute variables into text strings from the fields of the processed message.

The query query in the logstash config starts to look like this:

query => "(type:netconsole) AND (host:%{[host]}) AND (@timestamp:[now-1m TO now]) AND (message:temperature)"

And there would be happiness if not for the absence of an if inside the elasticsearch filter. The result is a perverted construction: we want to use elasticsearch for if, but we cannot, and where we can use if, we cannot make requests "to the past."

After several dozen attempts to "get on the tree", the following construction turned out: We, having discovered MCE, make a request in elastic, and fill in the field (in the message about MCE) with "mce_temperature" with the message content of the found message about 'temperature' on the host from the message about MCE, in the last minute. That is, we add to the message about MCE the previous one, about the temperature. If we find, of course.

The rest is simple: we have an MCE and we have the contents of the mce_temperature field with the “previous” message. Outside of the elasticsearch block, we write:

        if [temperature_mce] = ~ / Core temperature.speed normal / {
                noop {
                        remove_tag => ["notify"]
                        add_tag => ["supressed"]
                }
          }

(Using the notify tag, we send the message to shinken, and supress is just for the person to see why the message was not sent).

Catching cockroaches

Everything was fine, except that at the moment when I was writing this article (assuming that the problem has been solved), I get a happy nagios and say in a human voice: ** PROBLEM: ALERT ... kernel: [8871838.807783] mce: [Hardware Error ]: Machine check events logged

Look:

[Tue Feb 24 15:51:40 2015] CPU0: Core temperature / speed normal
[Tue Feb 24 15:52:54 2015] mce: [Hardware Error]: Machine check events logged

Oh, thanks! And I have the interval [now-1m TO now], that is 60 seconds. And the events interval was 74 seconds. So in the final version, the viewing interval had to be increased to 3 minutes.

The second “funny” problem that came a bit later was problems with reverse DNS. It turned out that DNS does not keep up with monitoring at some points and does not resolve all IPs. We usually resolve the IP in the name and write it in the host field (dns filter). And it turns out that the 'temperature' was written with IP in the host field (failed to resolve), and MCE came with the host name. Or vice versa. In any case, the vile bzz-bzzzz from the phone at night without a good reason.

The solution has also become simple: we resolve the name, and save the ip in another field. And we do the search by IP.

Conclusion

Final config (snippet):

  if [message] = ~ / Machine check events / {
     elasticsearch {
        hosts => ["localhost"]
        query => "(type: netconsole) AND (source_ip:% {[source_ip]}) AND (@timestamp: [now-3m TO now]) AND (message: temperature)"
        fields => ["message", "temperature_mce"]
     }
        if [temperature_mce] = ~ / Core temperature.speed normal / {
                noop {
                        remove_tag => ["notify"]
                        add_tag => ["supressed"]
                }
          }
   }

If you want to disable throttling MCE completely, then replace / Core temperature.speed normal / with / Core temperature /

I'm not sure that the proposed solution is the best practice, but it works and is minimally intrusive into the configuration. This approach allows you to solve a whole class of problems associated with multiline messages and allows you to make retrospective situational decisions without having to breed bigdata in vain.

Tags:

Log processing based on previous messages in logstash / elasticsearch

Catching cockroaches

Conclusion

Also popular now: