coolcold May 11, 2010 at 02:18

collectd - collect system and user statistics

Question number 0 - why?

In a post about pnp4nagios, I wrote "Nagios / Pnp4Nagios is not a substitute for the collection of statistics about the state of the system." Why do I think so? Because 1) the statistics of the state of the system are extensive and include many indicators 2) it does not always make sense to monitor them, or rather generate alerts. For example, knowing how many I / O operations a disk does or when context switching occurs is not bad, but almost never critical. Well and besides, Nagios is simply not intended for this. In this article, I will not make a full description of the system, I will limit myself to only particularly interesting, from my point of view, moments.

Question number 1 - why collectd?

Highlights why I chose collectd from Munin, Cacti and others:

Scalability
Lightness
Concept - everything is plugins
Data collection and recording are separated
The number of indicators collected
Extensibility

The general scheme of work of collectd:

Scalability

Push is used to collect data on the central node (s) (unlike poll / pull in Cacti / Munin). More than one node can deal with data storage, and moreover, you can separate data for storage on different nodes. Data transfer is a separate plugin - network.

Lightness

The main daemon and plugins are written in C and easily survive the 10 second interval of data collection without loading the system.

Everything is plugins

CPU utilization data collector - plugin. Information about the processes - plugin. Record and create RRD / CSV - plugin.

Data collection and recording are separated

Data can be both read and written. collectd divides plugins into "readers" and "writers." Those that collect information are readers. After while the data is read, they are sent to registered writers, which in general can be any. The most “popular” writer is the network plugin, which sends data to the central node and RRDTool, as a rule, by RRD they mean statistics. Thus, you can have statistics in the RRD on the node, and send data for further processing.

The number of indicators collected

Currently, there are more than 90 basic plug-ins for collecting information about the system and applications.

Extensibility

To add your own data sources, there are:

The exec plugin is generally a standard extension method - the program starts, the data output to stdout is processed, but collectd has a plus here - the program does not have to exit after the values are output, moreover, it is recommended to start and display the data in a loop, saving resources on startup, which especially true for scripts.
Python / Perl / Java bindings - both readers and writers, a more detailed description below

Extensibility through bindings (bindings)

Bindings are essentially plugins for accessing internal collectd mechanisms from other languages and writing plugins for them. Currently supported are Java / Perl / Python. For example, for Python, the interpreter is launched at the start of collectd, it is stored in memory, saving resources to run every few seconds and allows scripts to have access to the API.

Since the script can register as a data provider (reader) and / or as a writer (writer), a registered procedure will be called every time interval specified in the configuration. If everything is clear with the reader, then the writer should pay special attention - your script can easily be built-in to process all the passing data, i.e. for example, you can make your own base of stored values. Simple exampleThere is such a plugin in Python in the project documentation.

Interesting and useful plugin features

First of all, I liked the disk plugin - out of the box, it was able to measure the average disk response time:

Tail plugin - allows you to read files as tail (1) and pull values using regexp. Of the pre-set values, you can take the minimum / maximum / average, calculate the total number of records, add up, for example, for nginx, you can collect statistics on response time and requests for location like this:

Add a record about the request service time to the nginx log:
log_format maintime '$remote_addr - - [$time_local] reqtime=$request_time "$request" $status $body_bytes_sent "$http_referer" "$http_user_agent" "$host" upstream: $upstream_addr gzip:"$gzip_ratio"';

Create an entry in the collectd config:

Loadplugin tail

                Instance "nginxproxy"
                
                        Regex "reqtime = ([0-9] + \\. [0-9] +)"
                        DSType "GaugeMax"
                        Type "latency"
                        Instance "max responce time"
                
                        Regex "reqtime = ([0-9] + \\. [0-9] +)"
                        DSType "GaugeAverage"
                        Type "latency"
                        Instance "avg responce time"
                
                        Regex ". *"
                        DSType "CounterInc"
                        Type "derive"
                        Instance "requests"

We get graphs of the total number of requests, maximum response time and average response time, respectively:

Curl plugin has similar capabilities

The processes plugin - can collect information about the number of running processes falling under the filter, the number of threads, the size of the occupied memory, I / O, for example:

In this case, the data is not completely correct - these are not read / write disk operations, but in general everything, i.e. . The socket entry is also counted. I informed the authors, maybe they will fix it in the next version.
UnixSock plugin will allow you to exchange data using a simple text protocol , in particular, you can receive or send the counter value. With this plugin integration into Nagios is possible.

Other features

Filters and chains

Starting with version 4.6, a filter and chain mechanism appeared, similar to chains in iptables. Using this mechanism, you can filter data, for example, cut off values for which the timestamp is more or less than the current time by N, which can be useful if the clock goes down on some server. RRD will get time from the future and the readings will be distorted.

Notification and threshholds

The basic system of notifications and thresholds has appeared since version 4.3. Like readers and writers, there are "producers" and "consumers" - the former produce notifications, the latter process them. In particular, the Exec plugin can both respond to notifications, for example, run a script, and transmit notifications from scripts.

By configuring a set of threshold values, you can create notifications when abnormalities occur. However, it should be understood that these basic features do not replace the same Nagios. For full work with Nagios, you can use the bundled collectd-nagios program which allows you to query the socket created by the UnixSock plugin and return the result in the standard format for Nagios

disadvantages

By and large, I can classify by and large only a graph display system. Considering that about 200 tons of counters can be generated from one host, visualization is not in last place. The standard collection3 interface is not bad, but far from perfect. Currently, several independent chart display systems are being developed, but I can not recommend any yet.

Other

One of the developers of Sebastian Harl (tokee) is the main maintainer of the package in Debian, so there is almost always the latest version in backports

Tags: