collectd - collect system and user statistics

    Question number 0 - why?



    In a post about pnp4nagios, I wrote "Nagios / Pnp4Nagios is not a substitute for the collection of statistics about the state of the system." Why do I think so? Because 1) the statistics of the state of the system are extensive and include many indicators 2) it does not always make sense to monitor them, or rather generate alerts. For example, knowing how many I / O operations a disk does or when context switching occurs is not bad, but almost never critical. Well and besides, Nagios is simply not intended for this. In this article, I will not make a full description of the system, I will limit myself to only particularly interesting, from my point of view, moments.

    Question number 1 - why collectd?



    Highlights why I chose collectd from Munin, Cacti and others:
    1. Scalability
    2. Lightness
    3. Concept - everything is plugins
    4. Data collection and recording are separated
    5. The number of indicators collected
    6. Extensibility




    The general scheme of work of collectd:
    image

    Scalability

    Push is used to collect data on the central node (s) (unlike poll / pull in Cacti / Munin). More than one node can deal with data storage, and moreover, you can separate data for storage on different nodes. Data transfer is a separate plugin - network.

    Lightness

    The main daemon and plugins are written in C and easily survive the 10 second interval of data collection without loading the system.

    Everything is plugins

    CPU utilization data collector - plugin. Information about the processes - plugin. Record and create RRD / CSV - plugin.

    Data collection and recording are separated

    Data can be both read and written. collectd divides plugins into "readers" and "writers." Those that collect information are readers. After while the data is read, they are sent to registered writers, which in general can be any. The most “popular” writer is the network plugin, which sends data to the central node and RRDTool, as a rule, by RRD they mean statistics. Thus, you can have statistics in the RRD on the node, and send data for further processing.

    The number of indicators collected

    Currently, there are more than 90 basic plug-ins for collecting information about the system and applications.

    Extensibility

    To add your own data sources, there are:
    1. The exec plugin is generally a standard extension method - the program starts, the data output to stdout is processed, but collectd has a plus here - the program does not have to exit after the values ​​are output, moreover, it is recommended to start and display the data in a loop, saving resources on startup, which especially true for scripts.
    2. Python / Perl / Java bindings - both readers and writers, a more detailed description below


    Extensibility through bindings (bindings)


    Bindings are essentially plugins for accessing internal collectd mechanisms from other languages ​​and writing plugins for them. Currently supported are Java / Perl / Python. For example, for Python, the interpreter is launched at the start of collectd, it is stored in memory, saving resources to run every few seconds and allows scripts to have access to the API.

    Since the script can register as a data provider (reader) and / or as a writer (writer), a registered procedure will be called every time interval specified in the configuration. If everything is clear with the reader, then the writer should pay special attention - your script can easily be built-in to process all the passing data, i.e. for example, you can make your own base of stored values. Simple exampleThere is such a plugin in Python in the project documentation.

    Interesting and useful plugin features


    • First of all, I liked the disk plugin - out of the box, it was able to measure the average disk response time:
      sdb disk access time


    • Tail plugin - allows you to read files as tail (1) and pull values ​​using regexp. Of the pre-set values, you can take the minimum / maximum / average, calculate the total number of records, add up, for example, for nginx, you can collect statistics on response time and requests for location like this:
      1. Add a record about the request service time to the nginx log:
        log_format maintime '$remote_addr - - [$time_local] reqtime=$request_time "$request" $status $body_bytes_sent "$http_referer" "$http_user_agent" "$host" upstream: $upstream_addr gzip:"$gzip_ratio"';

      2. Create an entry in the collectd config:
        Loadplugin tail
        
                        Instance "nginxproxy"
                        
                                Regex "reqtime = ([0-9] + \\. [0-9] +)"
                                DSType "GaugeMax"
                                Type "latency"
                                Instance "max responce time"
                        
                                Regex "reqtime = ([0-9] + \\. [0-9] +)"
                                DSType "GaugeAverage"
                                Type "latency"
                                Instance "avg responce time"
                        
                                Regex ". *"
                                DSType "CounterInc"
                                Type "derive"
                                Instance "requests"
                        


      We get graphs of the total number of requests, maximum response time and average response time, respectively:
      image
      image
      image
      Curl plugin has similar capabilities
    • The processes plugin - can collect information about the number of running processes falling under the filter, the number of threads, the size of the occupied memory, I / O, for example:
      image
      In this case, the data is not completely correct - these are not read / write disk operations, but in general everything, i.e. . The socket entry is also counted. I informed the authors, maybe they will fix it in the next version.
    • UnixSock plugin will allow you to exchange data using a simple text protocol , in particular, you can receive or send the counter value. With this plugin integration into Nagios is possible.


    Other features


    Filters and chains

    Starting with version 4.6, a filter and chain mechanism appeared, similar to chains in iptables. Using this mechanism, you can filter data, for example, cut off values ​​for which the timestamp is more or less than the current time by N, which can be useful if the clock goes down on some server. RRD will get time from the future and the readings will be distorted.

    Notification and threshholds

    The basic system of notifications and thresholds has appeared since version 4.3. Like readers and writers, there are "producers" and "consumers" - the former produce notifications, the latter process them. In particular, the Exec plugin can both respond to notifications, for example, run a script, and transmit notifications from scripts.

    By configuring a set of threshold values, you can create notifications when abnormalities occur. However, it should be understood that these basic features do not replace the same Nagios. For full work with Nagios, you can use the bundled collectd-nagios program which allows you to query the socket created by the UnixSock plugin and return the result in the standard format for Nagios

    disadvantages


    By and large, I can classify by and large only a graph display system. Considering that about 200 tons of counters can be generated from one host, visualization is not in last place. The standard collection3 interface is not bad, but far from perfect. Currently, several independent chart display systems are being developed, but I can not recommend any yet.

    Other


    One of the developers of Sebastian Harl (tokee) is the main maintainer of the package in Debian, so there is almost always the latest version in backports

    Also popular now: