vmarunin June 18, 2012 at 19:40

Steroids for Munin

From the sandbox

Munin is a very good thing for monitoring servers, especially one or two. However, if the number of servers grows, it works worse and worse. Under the cat, the story of how I overclocked it to monitoring more than 1000 virtual machines (275K rrd files in the system).

Why munin

Munin is excellent:
- not demanding on resources (while there are few servers);
- simple setup (convenient defaults, simple text config);
- plugins are simply written (and there are a bunch of ready-made plugins).

Munin is terrible:
- convenient defaults do not change;
- Integration with Nagios is useless;
- uncomfortable grouping of charts;
- "does not like" long-running plugins, you need to make crutches;
- The code inside also does not shine with beauty.

Based on rrd, this adds both pros and cons.

Munin’s plug-in model turned out to be convenient, the developer can add graphics to the role without waiting for someone to start something in the central database. It still comes to us with the size of the config, but it’s convenient.
Most importantly, Munin has already been. Switching to another system means redoing existing software and retraining people.

Problems

- began to step on its tail (does not have time to complete in 5 minutes);
- very heavily loads the disk;
- it is inconvenient to edit configs (you constantly forget to add a new server or remove the old one);
- aggregated graphs are difficult to write;
- There was no integration with Nagios.

In addition, aggregated graphs lie. Because if you summarize the number of requests from 10 servers, and then turn off 5, then the historical data will decrease by 2 times! Naturally, since the total schedule is calculated each time and is not saved, the formula has changed - the previous schedule has changed.

And their decision

Overclock

Be sure to do the generation of graphs as CGI / FastCGI, it speeds up but not much.

It is possible to radically overclock Munin only in an expensive way, putting all rrd into memory (tmpfs). Nothing else, neither RAID nor SSD helps, alas. 275K rrd occupy 14GB, which is not so much, a server with 32GB RAM is not uncommon (a few more GB will gobble up the processes of Munin itself). But the disk may be the most ordinary.

Naturally, once every couple of hours you need to flush to disk and pack existing RRD just in case. A packed archive is written perfectly to a SATA drive.

Munin itself does not clean rrd, so you need to clean unused rrd
/ usr / bin / find / mnt / ramdrive / -type f -mtime +5 -delete

A small digression about the inevitability of memory usage

The problem with all such systems is that the data comes “transversely” (one measurement from each metric), but it is processed “longitudinally” (all measurements for one metric) and shuffling this stream is very difficult. You can write as rrd immediately “longitudinally” and wait for the record, you can write “transversely” in the database (just add records one at a time to a large table) and then read them slowly.
In any case, a large cache / index in memory is indispensable.

Since we have a lot of memory, then we run update without restricting processes, this helps with slow plugins. Only I cocked a timer to kill munin-update, which work longer than 3 minutes.

Next up is the blueprint drawing chart. Profiling showed that the more servers, the more config, and it is parsed every time munin-graph-cgi is started. My config has grown to 64 megabytes and parsing it took up to 7 seconds (100% CPU load). The solution is obvious, you do not need to parse it again, but in order to substitute this crutch, you need to edit the Munin code.

munin-update will read and write the config as usual, plus save the config object using the Storable module. munin-graph will read the Storable file, if any.
This accelerates the drawing of graphs, but eats up memory, 64 megabytes of the config turn into 375MB of virtual memory per process.

Oddly enough, this is not a big problem for munin-update, since it first expands the config in memory, and then does fork. As a result, in the top 1000 processes with RSS in 250MB and only 21GB RAM used (14 of them is rrd!), There is a
big problem with munin-graph, since there each process honestly eats its 400MB of memory, but so far there is enough memory.

The next problem was with the launch of munin-html, it did not have time to work out. It was cured by running asynchronously, and with several forks in the code. HTML is drawn once every 10, and not 5 minutes, but this is not necessary especially if you add a new server.

And make it more convenient

It is inconvenient to edit text configs with your hands, but it is very convenient to generate a script. Fortunately, by that time we already had a database of servers divided by systems (virtual machines) and data centers (physical hosts). A simple script rewrites the Munin config based on this database and groups the servers by system (Domain in terms of Munin). New servers appear automatically, old ones also disappear automatically. Beauty!

The next problem is drawing aggregated graphs.
Such a plan, with a script we make a selection of the necessary rrd, we take the last value from them and, for example, add it.

It turns out here is such a munin plugin for drawing the sum of requests for servers

#!/usr/bin/perl -w
use strict;
use warnings;
if($ARGV[0] && $ARGV[0] eq 'config') {
    print <<<'EOS';
host_name Aggregated
graph_title Total requests 
graph_args --base 1000 -l 0
graph_vlabel requests
graph_category Nginx 
graph_info Total count of requests
total.label total requests
total.info total requests
EOS    
    exit(0);
}
my @rrd_files = `/bin/ls /mnt/ramdrive/munin/oMobile/*-nginx_log_access-total-d.rrd`;
my $sum = 0;
foreach my $filename (@rrd_files) {
   next unless(-f $filename);
   my $lastline = `/usr/bin/rrdtool fetch $filename AVERAGE -s -15M -e now | fgrep -v 'nan' | tail -n 1`;
   if($lastline =~ m/^\d+: ([-+.0-9e]+)\s*$/) {
      $sum+=$1;
   }
}
print "total.value $sum\n";
exit(0);

Never do that! This is just for demonstration, it makes a cloud of unnecessary exec, it is mandatory to use RRD in production.

Most often, either a sum or a grouping of several lines on one chart is made (it is very clearly visible whether someone is knocking out of the crowd). But you can also calculate something more complex. For example, the average query processing time on each server and the number of requests to the server are given, then you can calculate the average query time throughout the system.

Pay attention to host_name, so you can create virtual hosts and domains in Munin and group interesting graphics in them.

If you read not only the last value from RRD, then you can catch the sharp ups and downs of the charts. As a rule, a sharp, 20 percent, change in the number of requests, response time, LA, etc. not good and requires attention. We take data for the last hour, day, week and compare. If the discrepancy is large, then you can send an alert (passive check in Nagios, for example).

Total

- Any system based on rrd can be overclocked by putting rrd files in memory (tmpfs);
- Munin convenient for small projects can be extended to a noticeable number of servers (for me> 1500 munin-node);
- A small script and you can draw quite complex graphics;
- A small script and you can analyze trends and report alerts on them.

Tags: