binfini May 30, 2017 at 15:51

Once again about storing logs in Zabbix

The topic of collecting and storing logs in Zabbix has been raised more than once . And not two . If you do not take into account that this is not a completely correct approach, seriously loading the database, such methods have another significant drawback - they store the entire logs. And if in the case of the logs of routers or Linux you can somehow come to terms with this, then event-log Windows starts to cause a lot of suffering both to the Zabbix server and to the system administrator who decides to collect and store it. Below is a solution to this problem.

Lyrical digression

All techies are already waiting in the next office, and we will pause a little, take a breath, emerge from our cozy fluffy IT world with unicorns and try to determine whether the game is worth the candle. After all, there are specialized solutions, best practices and other words that completely kill imagination and love for your collective farm with fences made of crutches. I am sincerely happy for people who have a full ELK stack, shards, fault tolerance and millions of log lines per second. These demigods rarely descend from their shining peaks to us, mere mortals. Therefore, they do not understand the simple desire to monitor their entire infrastructure and at the same time not be strangled by a toad. And monitoring includes not only numerical indicators. For example, the Windows Server security subsystem only manipulates logs. And if there’s no need to shoot sparrows from the cannon, and there is already established monitoring based on Zabbix, why not expand its capabilities? Perhaps I’ll just say a banality, but infrastructure monitoring is the cornerstone of a timely response to incidents, so the need to have in your arsenal such a tool as a log collector and analyzer compatible with Zabbix and consuming a minimum of resources should not be discussed.

To the cause!

Typically, when processing an Event log, the following scheme is used.

EvtSys is a tiny service that converts EventLog messages to the Syslog standard.
This bunch has at least 2 disadvantages:

Microsoft is not particularly ceremonious with the size of the log messages, often exceeding 4 kb, which contain an error message, help for the solution, links to the deployed solution, horses, gypsies, and bears ... And so on. And this despite the fact that, in addition to the message itself, nothing more is needed. As a result, we store tons of useless information in the database, and the disk space is non-rubber.
For each new message, RSyslog creates a new fork for zabbix_sender, and with a large amount of incoming data, you can easily kill yourself a log collection server, which is also unpleasant.

That is why we uncover our bicycle builder and begin to sculpt our brave new world.

General concept

Our new vehicle will have a ~~triangular wheel~~ following the scheme of work.

Heka is the heart of our pepelats. It is almost like Logstash, only the pipe is lower and the smoke is thinner. But without JRuby, and, therefore, is not so demanding on resources and, moreover, an amazingly fast punch. Quietly digests and processes several thousand lines of log per second. Like that:

At the same time, it easily and naturally expands with the help of plugins, which will be done in the future. The general logic of the system is as follows. Incoming messages EvtSys converts to the Syslog format and sends it to Heka for processing. That parses the syslog message, selects the inside of the message (payload) and parses further with a regular expression, forming a new log line. This string is passed directly to Zabbix using a proprietary plugin. Everything you need is placed on a Zabbix server, a separate machine, or, in our case, on a Zabbix proxy. This approach to the operation of the system eliminates both previously voiced drawbacks. We will examine in more detail how all this is arranged inside.

Configure EvtSys

I will not dwell on setting up EvtSys in detail, it has been done repeatedly before me, for example, here . I note only the main points that are important for the solution. EvtSys version must be at least 4.5. This is due to the fact that the -t switch will be used during installation. It is recommended to take the LP version, because it can send messages more than 2 kb, which is important for us. EvtSys installation command:

evtsys -i -h  -p  -f 17 -t

For Windows logs, I use facility local1, but you can use what suits you best. Particular attention should be paid to the -t option. In it, you must specify the host name exactly as it is indicated in Zabbix. This Heka field can be used to send messages to the right hosts on Zabbix.

Configuring the message processing pipeline

So we got to the most interesting and tasty part of our decision. I will not describe the Heka installation either, it is trivial to the point of horror and is well painted in the official documentation. Therefore, we just assume that Heka is already installed and go straight to setting it up. Like Logstash, Heka implements a label-based log conversion pipeline. The general path that the processed string passes is as follows:

Splitter and Filter may not be in the pipeline. In the end, it will be so with us. Which path the log line will go through, through which filters and how it will look as a result, is determined by conditions based on the fields of the internal representation of the message in Heka. You can read more about this in the documentation , and now we need to create the following Heka algorithm.

Receive syslog message over UDP port.
Decode syslog message, select an additional tag.
Parse the text of the log line (payload) and select key parameters and values from it.
Generate a new informative log line based on the selected values.
Encode the log line as a json message for Zabbix.
Send encoded message to Zabbix.

To implement the algorithm, we build the functional blocks of the Heka pipeline as follows:

MultiDecoder deserves separate explanations. It is a container for decoders that can be executed either sequentially (this is how it is configured in syslog-decoder), and until the first successful decoding (in events-decoder). Now let's see how this is implemented in the Heka config. In order not to clog the main config, we take out the entire configuration in a separate file. I have this /etc/heka/conf.d/20-eventlog.toml.

Heka config

Consider the settings for each component individually.

[eventlog]
type = "UdpInput"
address = ":10514"
decoder = "syslog-decoder"

Here in the settings we indicate that we will listen on all interfaces and take the port more than 1024, so as not to have problems with access rights to the port. This port must be specified in EvtSys as the port of the syslog server. Also, we immediately indicate the decoder to which we will transmit the received messages.

[syslog-decoder]
type = "MultiDecoder"
subs = ["rsyslog-decoder", "events-decoder"]
cascade_strategy = "all"
#log_sub_errors = true

As already mentioned above, this decoder is a container for other decoders with two types of behavior (execute all decoders or execute until the first successful decoding). The type of behavior is set by the cascade_strategy parameter and in this case it indicates that all decoders should be executed in the order specified in the subs parameter. To debug a multi-decoder, it is useful to use the log_sub_errors = true option. When specified, Heka will write sub-decoder decoding errors to its log.

[events-decoder]
type = "MultiDecoder"
subs = ["event-4624-decoder", "event-4625-decoder", "event-4724-decoder", "event-4738-decoder", "event-4740-decoder"]
cascade_strategy = "first-wins"
#log_sub_errors = true

Events-decoder itself is a multi-decoder, but its behavior strategy is already different - first-wins. It forces the decoder to execute until one of the internal decoders returns successfully decoded data. This helps to process data according to several patterns, which will be demonstrated later.

[rsyslog-decoder]
type = "SandboxDecoder"
filename = "lua_decoders/rsyslog.lua"
        [rsyslog-decoder.config]
        type = "RSYSLOG_TraditionalForwardFormat"
        template = '<%PRI%>%TIMESTAMP% %HOSTNAME% %HOSTNAME% %syslogtag:1:32%%msg:::sp-if-no-1st-sp%%msg%'
        tz = "Europe/Moscow"

One of the standard decoders that comes with Heka, according to the author, supports RSyslog log templates. But it seemed to me that not completely. Note the double parameter% HOSTNAME% in the log template. This is not a typo, I just could not configure the template so that the parameter passed through -t to EvtSys does not stick to the message text. And so we get practically what we need, except for one small trifle - at the end of the host name is a colon. We will remove it when encoding the message. You may ask: how is the value of the Hostname field parsed? And very simple - the new value simply overwrites the old.

Event decoders are all alike, like twin brothers, so I’ll only consider one. The differences in the rest apply only to regexp for parsing and the output line of the log.

[event-4624-decoder]
type = "PayloadRegexDecoder"
match_regex = '^(?P4624):(?:[^:]+:){10}\s(?P[^\s]+)[^:]+:\s(?P[^\s]+)(?:[^:]+:){8}\s(?P[^\s]+)'
        [event-4624-decoder.message_fields]
        Type = "windows.eventlog"
        Key = 'eventlog.%EventID%'
        Value = 'Успешный вход пользователя %Domain%\%Account% с адреса %IP%.'

This decoder runs the line through the regular expression and at the same time selects its individual parts, which can be used to create additional message fields. Such fields are then useful to us for creating an output message. Also, a message type is set here (Type field) to direct them to the corresponding output (associated with the corresponding encoder).

[ZabbixEncoder]
type = "SandboxEncoder"
filename = "lua_encoders/zabbix_trapper.lua"

And now the pearl of our collection is a self-written plugin on Lua. I will consider his work in due time (downstream of the text). Here it’s enough to write that it encodes a message for Zabbix.

[event-out-zabbix]
type = "TcpOutput"
message_matcher = "Type == 'windows.eventlog'"
address = "127.0.0.1:10051"
encoder = "ZabbixEncoder"
reconnect_after = 1

Exit for sending messages to Zabbix. Not everything passes through itself, but only messages with the corresponding value in the Type field. Each message passes through the Encoder, which we described above.

I recommend paying special attention to the reconnect_after = 1 parameter. This is a very important parameter. The fact is that TcpOutput contains a bug and cannot maintain a TCP session for a long time. It looks something like this: at a certain point, Zabbix instead of full-fledged log messages starts to receive some tweaks, and then it stops. If you look at the exchange between Heka and Zabbix via Wireshark, you can see that TcpOutput starts to beat packets and the connection is disconnected. As a result, to prevent this from happening, the above parameter is set, which causes TcpOutput to reconnect after each message is sent to Zabbix. Yes, of course, this creates an overhead, but I did not notice any significant increase in load on the host with Heka or the Zabbix proxy. Maybe just sending a few messages? ;)

Debug message processing

For debugging the pipeline, I recommend two things: RstEncoder and output to a file. RstEncoder encodes each message field with a text string, which allows you to see which fields the message consists of and what they are equal to. The output to the file helps to track what is being encoded and what is being skipped and how it is being encoded. You can organize the output to a file like this:

[event-out-file]
type = "FileOutput"
message_matcher = "Type == 'windows.eventlog'"
path = "/tmp/events.log"
perm = "666"
flush_count = 100
flush_operator = "OR"
encoder = "RstEncoder"

All parameters are clear here, I’ll only explain the purpose of the parameter flush_operator = “OR”. He points out that new chunks of messages should be appended to an existing log file. By changing the encoder, you can check the format of the recorded messages, and message_matcher will help make sure that the messages are of the correct type. Turning it off, we begin to record everything. Sometimes this helps to detect an error in the message_matcher filter or that messages are not generated.

Zabbix trapper

And now we come to the description of how the custom encoder works. All additional plugins should be in / usr / share / heka /. Since we have this encoder, we put it here - /usr/share/heka/lua_encoders/zabbix_trapper.lua. Let's see what kind of gears spin inside him.

Zabbix Trapper Module for Heka

require "os"
require "string"
require "table"
-- Библиотека JSON.
-- https://www.kyne.com.au/~mark/software/lua-cjson.php
local cjson = require("cjson")
function process_message()
-- Название хоста в Zabbix.
    local host = read_message("Hostname")
-- Ключ элемента данных.
    local key = read_message("Fields[Key]")
-- Значение элемента данных.
    local value = read_message("Fields[Value]")
-- Начинаем собирать JSON-сообщение. 
    local message = {}
-- Отрезаем паразитное двоеточие, которое у нас осталось со времён парсинга syslog сообщения.
    message["host"] = string.sub(host, 1, -2)
    message["key"] = key
    message["value"] = value
    local buffer = {message}
    local zabbix_message = {}
    zabbix_message["request"] = "sender data"
    zabbix_message["data"] = buffer
-- Создаём новый payload в сообщении, который состоит из подготовленного массива с данными, закодированного в JSON.
    inject_payload("json", "Payload", cjson.encode(zabbix_message))
    return 0
end

The CJSON library should be in a place accessible by Lua, so I put it in /usr/share/heka/lua_modules/cjson.so.

Template for Zabbix

Of the features of the template, it should be noted that the data elements must be of the Zabbix Trapper type, with the information type “Log (log)” and keys of the form eventlog.1102.

Zabbix Template

3.02016-10-01T13:30:30ZШаблоны безопасности{Template Windows Users Audit:eventlog.4732.nodata(120)}=0{HOST.NAME}: Добавлен участник в защищённую локальную группу.04{HOST.NAME}: Добавлен участник в защищённую локальную группу.
{ITEM.LASTVALUE}0{Template Windows Users Audit:eventlog.1102.nodata(120)}=0{HOST.NAME}: Журнал аудита очищен.04{HOST.NAME}: Журнал аудита очищен.
{ITEM.LASTVALUE}0{Template Windows Users Audit:eventlog.4740.nodata(120)}=0{HOST.NAME}: Заблокирована учётная запись пользователя.03{HOST.NAME}: Заблокирована учётная запись пользователя.
{ITEM.LASTVALUE}0{Template Windows Users Audit:eventlog.4735.nodata(120)}=0{HOST.NAME}: Изменена защищённая локальная группа безопасности.04{HOST.NAME}: Изменена защищённая локальная группа безопасности.
{ITEM.LASTVALUE}0{Template Windows Users Audit:eventlog.4738.nodata(120)}=0{HOST.NAME}: Изменена учётная запись пользователя.04{HOST.NAME}: Изменена учётная запись пользователя.
{ITEM.LASTVALUE}0{Template Windows Users Audit:eventlog.4781.nodata(120)}=0{HOST.NAME}: Изменено имя учётной записи.04{HOST.NAME}: Изменено имя учётной записи.
{ITEM.LASTVALUE}0{Template Windows Users Audit:eventlog.4725.nodata(120)}=0{HOST.NAME}: Отключена учётная запись пользователя.04{HOST.NAME}: Отключена учётная запись пользователя.
{ITEM.LASTVALUE}0{Template Windows Users Audit:eventlog.4724.nodata(120)}=0{HOST.NAME}: Попытка сбросить пароль учётной записи.04{HOST.NAME}: Попытка сбросить пароль учётной записи.
{ITEM.LASTVALUE}0{Template Windows Users Audit:eventlog.4731.nodata(120)}=0{HOST.NAME}: Создана защищённая локальная группа безопасности.04{HOST.NAME}: Создана защищённая локальная группа безопасности.
{ITEM.LASTVALUE}0{Template Windows Users Audit:eventlog.4720.nodata(120)}=0{HOST.NAME}: Создана учётная запись пользователя.04{HOST.NAME}: Создана учётная запись пользователя.
{ITEM.LASTVALUE}0{Template Windows Users Audit:eventlog.4734.nodata(120)}=0{HOST.NAME}: Удалена защищённая локальная группа безопасности.03{HOST.NAME}: Удалена защищённая локальная группа безопасности.
{ITEM.LASTVALUE}0{Template Windows Users Audit:eventlog.4726.nodata(120)}=0{HOST.NAME}: Удалена учётная запись пользователя.03{HOST.NAME}: Удалена учётная запись пользователя.
{ITEM.LASTVALUE}0{Template Windows Users Audit:eventlog.4733.nodata(120)}=0{HOST.NAME}: Удален участник из защищённой локальной группы.03{HOST.NAME}: Удален участник из защищённой локальной группы.
{ITEM.LASTVALUE}0{Template Windows Users Audit:eventlog.4624.nodata(10)}=0{HOST.NAME}: Успешный вход пользователя в систему.01{HOST.NAME}: Успешный вход пользователя в систему.
{ITEM.LASTVALUE}0{Template Windows Users Audit:eventlog.4625.nodata(120)}=0{HOST.NAME}: Учётной записи не удалось выполнить вход в систему.02{HOST.NAME}: Учётной записи не удалось выполнить вход в систему.
{ITEM.LASTVALUE}0

Probably, you have already noticed the strange arrangement of triggers. Since we do not have a threshold value, and the incoming data is inconsistent, such a construction of triggers seemed to me the most correct. The trigger burns for 2 minutes, while with high importance corresponding notifications are sent and the event is not lost.

And what is the result?

The system turned out from the discharge configured and forgot. He has been working hard in the field of operation for several months, without requiring any maintenance. Of course, like any good system, it has the potential for expansion. From what is planned in the future:

A wider range of events.
Template optimization.
Support for sending events in batches (it will be good with a large stream).
More optimal opening and closing of TCP connections (reduce overhead).

Well, and at the end of a dash of analytics - the schedules of the Zabbix server queue with direct logging to the database and after optimization.

A miserable attempt to enable direct logging - you can see how the queue grows abnormally, and then Zabbix cracks in half, the patient was barely able to save. One host was logged with a fairly high load on the security log.

So already more like the truth. Yes, a certain number of messages still sticks in the queue, but this is not a problem with Windows logs, and in general, you can estimate the performance gain by the total length of the queue. In addition, two hosts are monitored here, there are no less messages, but the load of this particular monitoring subsystem is successfully lost against the general background.

PSWithout pretending to be anything, it was interesting to share his decision. If someone knows the alternative, I will be glad to your comments.

Tags: