amarao August 1, 2014 at 15:47

Little admin story: how to catch OOM

Admin riddle: Three oom kill'as happened on the server, and monitoring said only about two. Why?

Configuration

To monitor everything, we have configured a bunch of ganglia-shinken-logstash-elasticsearch-kibana. The full description is quite extensive , so I will limit myself only to the part related to the problem.

Logstash sends logs from all servers. He stacks them in elasticsearch. The logstash config is configured to respond to all sorts of strange messages that indicate problems. If a message appears, it is sent to event monitoring (shinken), which, by various methods, starts to bother administrators.

In addition to syslogs that send messages from most applications, we also configured to send netconsolefrom all cores. The technology itself is simple to impossibility - in addition to dmesg, the kernel sends messages in the form of UDP datagrams to the specified IP and mac address. The MAC address is needed because the netconsole is very low level and is not going to figure out how to make a MAC from IP (that is, ARP). Due to the low-level messages pass even in situations of complete cataclysm. For example, if the software switch has stopped working (and the network is unavailable), messages will still be sent. Moreover, they will be sent even if iptables says -j drop_vsyo_nafig. And, most importantly and valuable, these messages will be sent successfully if the disk subsystem does not fully work. That is, for post-mortem research, "what exactly happened with the hanging server" is it.

An obvious candidate for "bad" messages is a message from oom-killer.

[517935.914380] ntpd invoked oom-killer: gfp_mask = 0x201da, order = 0, oom_score_adj = 0
[517935.914730] Call Trace:
[517935.914807] [] dump_header + 0x83 / 0xbb
[517935.914877] [] oom_kill_process.part.6 + 0x55 / 0x2cf
...
with the final triumphant: 
[517935.951044] Out of memory: Kill process 4550 (apache2) score 247 or sacrifice child
[517935.951203] Killed process 4550 (apache2) total-vm: 2610268kB, anon-rss: 2012696kB, file-rss: 3928kB

So, back to the riddle. There is a commissioning, pre-production, when, suddenly, an Apache (more precisely, a wsgi-application) obscures the data, and it is beaten with the words “go be fat somewhere else”. Admins receive a message. Everything seemed to be good (well, in the admin sense, “good”). But ...

Three oom happened, messages came about two. Monitoring is ok, netconsole is ok. The riddle? Problems? Symptoms of a mysterious unknown garbage? Call the court shaman with a tambourine?

Data collection

First question: Was this "third" oom? Recheck locally on the servers - was. I write out with pens - three processes were definitely killed.

Second question: mail? We check - according to the shinken logs - two operations, everything that shinken generated has reached.

Third question: Does netconsole not work?

Check: # echo "performing tests" > /dev/kmsg

In kibana a message appeared. So sending and receiving works.

And what do the OOM messages look like in kibana? The naive request “invoked” (not the most common word for logs) - three messages.

How are three? Why are there two alerts? We look at the filters, the filters are not excited on the line “invoked oom-killer”, but on the line “Out of memory: Kill process. + (. +)”.

We look. Wow, the lines “Out of memory: Kill process” in kibana are two, not three. Invoked three, kill two.

We configure sampling in kibana so that we are left alone with only the specific output of netconsole about this OOM. [Spoiler: how the admin survives with kibana - in the next episodes]. We look. It seems that everything is there. We begin to compare line by line with what is in dmesg on the host. You can see the trace, you can see a rather nasty looking dump of free memory (it is sent in separate pieces, so in logstash it turns into a set of incoherent messages like "5040", "1 * 256kB (U)", etc.), there is a list of processes ... Yeah , the list of processes is not complete, and in general, there is no bottom half of the sheet.

For the two previous OOM operations, everything is in place, but here there are some messages.

So, a concrete fact No. 1: not full oom log was recorded in elasticsearch. Since shinken is excited when a message is received, it can be assumed that the log did not reach the filters completely either, that is, elasticsearch and its interaction with logstash can be skipped for now.

We narrow the focus of the proceedings to the problem of obtaining netconsole. Why? Well, on the one hand, UDP, not guaranteed delivery and all that, and on the other hand, it is our local network, there were no monitoring operations about increased network load, snmp from the switches did not complain about anything, there were no blinking network links either. The network load graphs are also even - no connectivity problems are observed.

A moment of intuition: And what size record do we have from oom kill? I write to the file, wc -c. It turns out, 12.5kb. Hypotheses about “not getting into UDP” are not suitable, because each message is short - hundreds of bytes. But the question is: who is faster - the kernel that generates UDP packets at a low level in the kernel thread, or the java application that runs filters and regexps on received messages, writes their database and sends events to other servers for messages? Obviously, the " special high-grade class of software that makes careful use of relevant software architecture design principles to build particularly customizable and extensible solutions to real problems " handles faster than the dirty C code in which the word goto occurs 119,287 times ( 3.16-rc7).

Question: what happens to packages that are faster than enterprise grade software can handle them? A good wizard helps them fall asleep, right at the exit from the network interface using a headshot.

We focus on this issue. Suddenly, a sensation, drama, scandal, a network interface on one of the servers with logstash yells about universal injustice, but no one hears:

bond0 Link encap: Ethernet HWaddr blah blah blah
          UP BROADCAST RUNNING MASTER MULTICAST MTU: 1500 Metric: 1
          RX packets: 455816267 errors: 0 dropped: 130361 overruns: 0 frame: 0
          TX packets: 73379669 errors: 0 dropped: 0 overruns: 0 carrier: 0
          collisions: 0 txqueuelen: 0

One hundred thirty thousand packets, instead of living, rejoicing, bringing happiness, sleep on the network interface, and no one can wake them.

We fix fact No. 2: monitoring does not control the number of drops on network interfaces. Regardless of further proceedings, “can this be repaired?” - we have a valuable observation, which immediately turns into a task in the task traker: "monitor the change in the number of drops on network interfaces." Another run in Google suggests that, alas, resetting the counters on the interface by humane methods is impossible, so we fix the changes.

Forensic system administration ends at this point. The reason for the failure of monitoring is clear to the bottom: OOM causes the sending of a large number of messages in an extremely short period of time, java does not keep up with it, redundant packets are dropped. In this case, including, a packet was missing with a message that the application was killed - an event was not sent to shinken.

Mending

The second part - how do we deal with this?
A short run in Google showed that the buffer size for the socket when reading UDP datagrams is controlled by the application ( man setsockopt , SO_RCVBUF). We continue the sprint with obstacles on Google - there is an option in the logstash config, called buffer_size . According to common sense, we raised it to 1Mb. The existence of net.core.rmem_max is also googled - by default it is 200 + kb. Most likely it doesn’t apply to our case (only 12kb), but raising it to 2Mb will not hurt.

Result

0. The reason for the failure of monitoring has been clarified.
1. Now shinken monitors drops on all network interfaces.
2. logstash is ready to accept a larger burst of logging without packet loss.

PS For those who talk about "tcp for syslog." Netconsole is very low-level, and can’t do anything, even UDP it does with the “fill in the fields and send ethernet-frame” method. And, probably, the kernel would look very funny if it started to slow down due to the thought of recording logs over TCP ...

Tags:

Little admin story: how to catch OOM

Configuration

Data collection

Mending

Result

Also popular now: