santal December 23, 2014 at 11:09

How I patched Zabbix

The other day, I finally got my hands on the Zabbix update.

Since reading the article Zabbix 2.2 came out on the corresponding blog, I could not wait for Gentoo to unmask version 2.2. There was practically no such innovation in this version that I would not be interested in and useful in “everyday life”. This monitoring VMware, and system acceleration, and improvements in LLDP, in short almost every item.

Months passed, and version 2.2 was not even in disguised ones.

Sometimes I put things aside and do something “parallel” with respect to urgent and important tasks and work. This time I remembered the desire to upgrade Zabbix to version 2.2.

Checked in disguised *, well, finally, there is 2.2.5

Well, let's move on, a year has passed since the release, the stable version doesn’t exist, so no matter what happens, we’ll decide.

Unmask, collect wherever necessary (the main one is of course the server and proxy), restart. We reinstall the web interface of III ...
And nothing, the database is being updated.
The thing is not fast, my base is not small, and then the process generally hangs.

Well, I think it began, as they say, did not have time to start, but everything is already bad.

In mysql logs there was Error number 28 means 'No space left on device', there was more than enough space everywhere. As the saying goes, “if I have Google, I’m god-like” (c), not immediately, but I managed to find / guess that this device is ibdata1 and ibdata2, the size of which is regulated by the innodb_data_file_path parameter. After I changed max from 256M to 512M, the database update was successful and the server started.

There were problems with the proxy too, and also because of the database. Just sqlite is not updated, so we stop the proxy, delete the old database and start the proxy. As they say carefully read Upgrade notes

Of course, a lot of expired data has accumulated during the update, so we check that we can verify that everything is displayed in the interface, and wait until everything settles down and is updated.

After a couple of hours, we look at the schedule: The

day began to become languid.

It’s our turn. Where? .. The main expired data from one of the proxies. And what kind of data. And the data received on SNMPv3.

Excellent. I had questions to this functional before, but all hands did not reach, and there was a hope that the update would solve these issues. And then the system becomes practically unworkable. At one time, reading about how people using one server or proxy monitor hundreds of network devices, I could not understand what was the matter? I have several dozen devices, and everything works to the limit. Oh, and the database is completely optimized, and the server is put on a fast datastore, and a lot of memory has been given to it.

I did not want to roll back, so it was decided at all costs to try to deal with the problem.

I chose one of the proxies, behind which the most SNMP devices, and began to understand.

Here is what we have in the logs on the proxy **:

4447:20141218:124053.605 SNMP agent item "ifAdminStatus.["10130"]" on host "co-xx02" failed: first network error, wait for 15 seconds
4468:20141218:124108.270 resuming SNMP agent checks on host "co-xx02": connection restored

And so randomly across all SNMP hosts, with different items.

I definitely don’t have network problems, but for the sake of fidelity, of course, I quickly checked the connectivity, speed and logs of the switches. We check the SNMP itself using snmpwalk and pull the whole tree. No problem.
Google. Someone has pollers clogged, someone has incorrect timeouts, some bug related to this is fixed in version 2.2.3. Someone has a tricky network problem and UDP is lost. But this is not our case.
What then? ..

An interesting fact, if you restart the proxy

/etc/init.d/zabbix-proxy restart

You can see how the queue begins to decline, but then bang, again something happens and it instantly grows up ***.

What is happening? ..
Turn on the extended logging to zabbix-proxy
DebugLevel = 4
Restart zabbix-proxy and wait for errors to appear

Now instead of Network Error we see more complete information

It looks something like this

5414:20141218:125955.481 zbx_snmp_get_values() snmp_synch_response() status:1 errstat:-1 mapping_num:94
5414:20141218:125955.481 End of zbx_snmp_get_values():NETWORK_ERROR
5414:20141218:125955.481 End of zbx_snmp_process_standard():NETWORK_ERROR
5414:20141218:125955.481 In zbx_snmp_close_session()
5414:20141218:125955.481 End of zbx_snmp_close_session()
5414:20141218:125955.481 getting SNMP values failed: Cannot connect to "192.168.x.x:161": Too long.
5414:20141218:125955.481 End of get_values_snmp()
5414:20141218:125955.481 In deactivate_host() hostid:10207 itemid:43739 type:6
5414:20141218:125955.481 query [txnlev:1] [begin;]
5414:20141218:125955.481 query [txnlev:1] [update hosts set snmp_errors_from=1418896795,snmp_disable_until=1418896810,snmp_error='Cannot connect to "192.168.x.x:161": Too long.' where hostid=10207]
5414:20141218:125955.481 query [txnlev:1] [commit;]
5414:20141218:125955.481 SNMP agent item "ifOperStatus.["10143"]" on host "co-xx04" failed: first network error, wait for 15 seconds
5414:20141218:125955.481 deactivate_host() errors_from:1418896795 available:1
5414:20141218:125955.482 End of deactivate_host()

Status 1, error status -1, number of elements 94
Here is the conclusion that it is NETWORK_ERROR
And a little lower decryption Too long and deactivation of the host. It is clear that if the host is deactivated, data cannot be received from it, the data is queued, and this is the explanation of the queue.

Immediately interested in the errstat parameter.

Doing cat /var/log/zabbix/zabbix_proxy.log | grep errstat

5412:20141218:130351.410 zbx_snmp_get_values() snmp_synch_response() status:0 errstat:0 mapping_num:11
5433:20141218:130351.470 zbx_snmp_get_values() snmp_synch_response() status:1 errstat:-1 mapping_num:94
5430:20141218:130351.476 zbx_snmp_get_values() snmp_synch_response() status:1 errstat:-1 mapping_num:94
5417:20141218:130353.442 zbx_snmp_get_values() snmp_synch_response() status:0 errstat:0 mapping_num:5
5420:20141218:130353.534 zbx_snmp_get_values() snmp_synch_response() status:1 errstat:-1 mapping_num:94

Yeah, digging deeper

There should be a picture of We need to go deeper, but it will not. I think she got everyone.

Making cat /var/log/zabbix/zabbxi_proxy.log | grep errstat: -1

5416:20141218:130353.540 zbx_snmp_get_values() snmp_synch_response() status:1 errstat:-1 mapping_num:94
5412:20141218:130355.571 zbx_snmp_get_values() snmp_synch_response() status:1 errstat:-1 mapping_num:94
5417:20141218:130355.591 zbx_snmp_get_values() snmp_synch_response() status:1 errstat:-1 mapping_num:94
...
5420:20141218:130453.187 zbx_snmp_get_values() snmp_synch_response() status:1 errstat:-1 mapping_num:94
5412:20141218:130455.206 zbx_snmp_get_values() snmp_synch_response() status:1 errstat:-1 mapping_num:94
5413:20141218:130455.207 zbx_snmp_get_values() snmp_synch_response() status:1 errstat:-1 mapping_num:94

The time has come to turn off monitoring of all devices except for one proxy, because otherwise, the log is too difficult to understand. Anyway, the system as it is now is not suitable for monitoring.
Disable, do
cat /var/log/zabbix/zabbxi_proxy.log | grep mapping_num
and reload the proxy
No errors the first time, and mapping_num is slowly growing from 1 more and more (individual lines are cut to show the principle)

You can always watch how mapping_num grows

7876:20141218:131251.660 zbx_snmp_get_values() snmp_synch_response() status:0 errstat:0 mapping_num:4
7872:20141218:131251.681 zbx_snmp_get_values() snmp_synch_response() status:0 errstat:0 mapping_num:6
7872:20141218:131251.919 zbx_snmp_get_values() snmp_synch_response() status:0 errstat:0 mapping_num:8
7876:20141218:131251.919 zbx_snmp_get_values() snmp_synch_response() status:0 errstat:0 mapping_num:9
7868:20141218:131351.965 zbx_snmp_get_values() snmp_synch_response() status:0 errstat:0 mapping_num:13
10502:20141218:135237.884 zbx_snmp_get_values() snmp_synch_response() status:0 errstat:0 mapping_num:31
10507:20141218:135238.244 zbx_snmp_get_values() snmp_synch_response() status:0 errstat:0 mapping_num:62
12429:20141218:141637.942 zbx_snmp_get_values() snmp_synch_response() status:0 errstat:0 mapping_num:31
12429:20141218:141637.966 zbx_snmp_get_values() snmp_synch_response() status:0 errstat:0 mapping_num:31
12433:20141218:141651.142 zbx_snmp_get_values() snmp_synch_response() status:1 errstat:-1 mapping_num:94

And then oppa 94, and -1 and too long. Those. immediately after starting, the proxy tests the devices, sends them SNMP requests, increasing the number of items in one request. The queue begins to shrink rapidly. Then it (that is, the proxy) reaches the magic number 94, a failure occurs and the devices begin to turn off by the zabbix for 15 seconds, which in turn begins to sharply increase the queue.

As you can see, here is no longer Network error, here is Too long.

Okay, we’re trying to find something by zabbix snmp too long, nothing.

Again timeouts, overloading the poller ... In one interesting post there was information that such an error occurred when the OID for the item was incorrectly generated, so I double-checked all my OIDs, including through snmpget,

i.e. As a result, Google could not help me.

We will sort it out ourselves, this is useful.
So what do we have?
As soon as the number of items becomes 94 (i.e. large enough) something happens and the system goes astray.

Here again, a picture that will not be;)

It's time to get into the code. There is no need to download anything in gentoo, everything is already there, so I just unpacked everything into a working directory.

First we find where the error is displayed. We search by errstat
. Surprisingly there are only two such places in the file with the speaking name checks_snmp.c

these two places:
from line 745

/* communicate with agent */
status = snmp_synch_response(ss, pdu, &response);
zabbix_log(LOG_LEVEL_DEBUG, "%s() snmp_synch_response() status:%d errstat:%ld max_vars:%d",
                __function_name, status, NULL == response ? (long)-1 : response->errstat, max_vars);

and from line 938

status = snmp_synch_response(ss, pdu, &response);
zabbix_log(LOG_LEVEL_DEBUG, "%s() snmp_synch_response() status:%d errstat:%ld mapping_num:%d",
                __function_name, status, NULL == response ? (long)-1 : response->errstat, mapping_num);

While we are interested in the second piece (based on the fact that mapping_num is only in it),

even the programmer does not see that we have a response NULL, but why? ..

Recall that with errstat: -1, which is now clear where it comes from, we have status :1. Those. the snmp_synch_response function returns 1, but what does it mean? ..
And that means STAT_ERROR (1) (and also it can do STAT_TIMEOUT (2) and STAT_SUCCESS (0))

As they say, it’s not clear, but great ... Let's go
on the other hand, somewhere here but in this file should be return NETWORK_ERROR try to figure out where and why.

The first entry into the zbx_get_snmp_response_error function (which seems to hint)

View zbx_get_snmp_response_error code

static int zbx_get_snmp_response_error(const struct snmp_session *ss, const DC_INTERFACE *interface, int status,
                const struct snmp_pdu *response, char *error, int max_error_len)
{
  int     ret;
  if (STAT_SUCCESS == status)
  {
    zbx_snprintf(error, max_error_len, "SNMP error: %s", snmp_errstring(response->errstat));
    ret = NOTSUPPORTED;
  }
  else if (STAT_ERROR == status)
  {
    zbx_snprintf(error, max_error_len, "Cannot connect to \"%s:%hu\": %s.",
                interface->addr, interface->port, snmp_api_errstring(ss->s_snmp_errno));
    switch (ss->s_snmp_errno)
    {
      case SNMPERR_UNKNOWN_USER_NAME:
      case SNMPERR_UNSUPPORTED_SEC_LEVEL:
      case SNMPERR_AUTHENTICATION_FAILURE:
        ret = NOTSUPPORTED;
        break;
      default:
        ret = NETWORK_ERROR;
    }
  }
  else if (STAT_TIMEOUT == status)
  {
    zbx_snprintf(error, max_error_len, "Timeout while connecting to \"%s:%hu\".",
                interface->addr, interface->port);
    ret = NETWORK_ERROR;
  }
  else
  {
    zbx_snprintf(error, max_error_len, "SNMP error: [%d]", status);
    ret = NOTSUPPORTED;
  }
  return ret;
}

Yeah. Those. we with our STAT_ERROR enter switch where we do not fall under any of the above conditions, and thus we get NETWORK_ERROR by default.
We already realized that this default disorientates us, we need to find out what kind of error it really is. The error code is stored in ss-> s_snmp_errno, add the output of the variable to the log.

The programmer from me is so-so, so quickly with the help of a crowbar and someone’s mother (s) I patched a patch, like this:

diff -urN zabbix-2.2.5/src/zabbix_server/poller/checks_snmp.c zabbix-2.2.5.new/src/zabbix_server/poller/checks_snmp.c
--- zabbix-2.2.5/src/zabbix_server/poller/checks_snmp.c 2014-07-17 17:49:45.000000000 +0400
+++ zabbix-2.2.5.new/src/zabbix_server/poller/checks_snmp.c     2014-10-10 16:38:31.000000000 +0400
@@ -938,7 +938,7 @@
  status = snmp_synch_response(ss, pdu, &response);
  zabbix_log(LOG_LEVEL_DEBUG, "%s() snmp_synch_response() status:%d errstat:%ld mapping_num:%d",
-                   __function_name, status, NULL == response ? (long)-1 : response->errstat, mapping_num);
+                  __function_name, status, NULL == response ? (STAT_ERROR == status ? (long) ss->s_snmp_errno : (long)-1) : response->errstat, mapping_num);
  if (STAT_SUCCESS == status && SNMP_ERR_NOERROR == response->errstat)
  {

If the status STAT_ERROR is displayed ss-> s_snmp_errno I threw the

source of the zabbix into the local repository, quickly fixed the ebuild and go.
Compile, restart, wait.

And here she is our real mistake.

 11211:20141218:155253.362 zbx_snmp_get_values() snmp_synch_response() status:0 errstat:0 mapping_num:18
 11210:20141218:155253.393 zbx_snmp_get_values() snmp_synch_response() status:1 errstat:-5 mapping_num:94

Error -5
See Net-SNMP snmp_api.h

 #define SNMPERR_TOO_LONG                (-5)

Something similar was visible in the logs, but we could not find anything by the Too Long phrase, let's see what kind of error it was and when it occurs.
in snmp_api.c you can see it

some more code

/*
 * Make sure we don't send something that is bigger than the msgMaxSize
 * specified in the received PDU.
 */
if (pdu->version == SNMP_VERSION_3 && session->sndMsgMaxSize != 0 && length > session->sndMsgMaxSize) {
    DEBUGMSGTL(("sess_async_send",
                "length of packet (%lu) exceeds session maximum (%lu)\n",
                (unsigned long)length, (unsigned long)session->sndMsgMaxSize));
    session->s_snmp_errno = SNMPERR_TOO_LONG;
    SNMP_FREE(pktbuf);
    return 0;
}
/*
 * Check that the underlying transport is capable of sending a packet as
 * large as length.
 */
if (transport->msgMaxSize != 0 && length > transport->msgMaxSize) {
    DEBUGMSGTL(("sess_async_send",
                "length of packet (%lu) exceeds transport maximum (%lu)\n",
                (unsigned long)length, (unsigned long)transport->msgMaxSize));
    session->s_snmp_errno = SNMPERR_TOO_LONG;
    SNMP_FREE(pktbuf);
    return 0;
}

There are only two options:
1. The length of the data we want to send is longer than the msgMaxSize parameter defined in the received PDU
2. The underlying transport is not able to send a packet of this length

The question arises of how to fix this error. From the above it follows that we need to look for whether we get msgMaxSize, whether we handle it correctly, etc. etc. But I see the source code for zabbix for the first time, and C for the second (okay for the third).
In short, it does not cause enthusiasm ... Yes, and probably something can be broken.

Lyrical digression: I
must say that during the proceedings with this problem, I also came across information about mass SNMP processing. Those. zabbix can request multiple SNMP data items in a single request.

Details SNMP bulk processing

The bottom line is that zabbix can query up to 128 values in a single request, but not all devices can process these 128 values at a time. And zabbih uses a search strategy for the maximum value for each particular device. By the way, we saw this in the logs. Gradual increase of mapping_num. As soon as zabbix receives an error from the SNMPERR_TOO_BIG device, it searches for a maximum value that returns error-free results using a certain algorithm.

Why am I doing this.
There is a mechanism for handling the overflow error (let's call it that) in the zabbix, you just need to expand it by one more case.
The algorithm itself is painted under the conclusion of our error.

Again this code

else if (1 < mapping_num &&
        ((STAT_SUCCESS == status && SNMP_ERR_TOOBIG == response->errstat) || STAT_TIMEOUT == status))
{
  /* Since we are trying to obtain multiple values from the SNMP agent, the response that it has to  */
  /* generate might be too big. It seems to be required by the SNMP standard that in such cases the  */
  /* error status should be set to "tooBig(1)". However, some devices simply do not respond to such  */
  /* queries and we get a timeout. Moreover, some devices exhibit both behaviors - they either send  */
  /* "tooBig(1)" or do not respond at all. So what we do is halve the number of variables to query - */
  /* it should work in the vast majority of cases, because, since we are now querying "num" values,  */
  /* we know that querying "num/2" values succeeded previously. The case where it can still fail due */
  /* to exceeded maximum response size is if we are now querying values that are unusually large. So */
  /* if querying with half the number of the last values does not work either, we resort to querying */
  /* values one by one, and the next time configuration cache gives us items to query, it will give  */
  /* us less. */
  if (*min_fail > mapping_num)
      *min_fail = mapping_num;
  if (0 == level)
  {
    /* halve the number of items */
    int     base;
    ret = zbx_snmp_get_values(ss, items, oids, results, errcodes, query_and_ignore_type, num / 2,
            level + 1, error, max_error_len, max_succeed, min_fail);
    if (SUCCEED != ret)
        goto exit;
    base = num / 2;
    ret = zbx_snmp_get_values(ss, items + base, oids + base, results + base, errcodes + base,
            NULL == query_and_ignore_type ? NULL : query_and_ignore_type + base, num - base,
            level + 1, error, max_error_len, max_succeed, min_fail);
  }
  else if (1 == level)
  {
    /* resort to querying items one by one */
    for (i = 0; i < num; i++)
    {
      if (SUCCEED != errcodes[i])
        continue;
      ret = zbx_snmp_get_values(ss, items + i, oids + i, results + i, errcodes + i,
              NULL == query_and_ignore_type ? NULL : query_and_ignore_type + i, 1,
              level + 1, error, max_error_len, max_succeed, min_fail);
      if (SUCCEED != ret)
        goto exit;
    }
  }
}

That is, everything is simple, we need to add our condition, without disrupting the existing ones. For this, we have all the data:

status must be STAT_ERROR
ss-> s_snmp_errno must be SNMPERR_TOO_LONG

We also take into account that we have two such places (as well as two outputs to the log file) and the resulting patch will be like this:

At last

diff -urN zabbix-2.2.5/src/zabbix_server/poller/checks_snmp.c zabbix-2.2.5.new/src/zabbix_server/poller/checks_snmp.c
--- zabbix-2.2.5/src/zabbix_server/poller/checks_snmp.c 2014-07-17 17:49:45.000000000 +0400
+++ zabbix-2.2.5.new/src/zabbix_server/poller/checks_snmp.c     2014-10-10 16:38:31.000000000 +0400
@@ -746,10 +746,10 @@
                status = snmp_synch_response(ss, pdu, &response);
                zabbix_log(LOG_LEVEL_DEBUG, "%s() snmp_synch_response() status:%d errstat:%ld max_vars:%d",
-                               __function_name, status, NULL == response ? (long)-1 : response->errstat, max_vars);
+                               __function_name, status, NULL == response ? (STAT_ERROR == status ? (long)ss->s_snmp_errno : (long)-1) : response->errstat, max_vars);
                if (1 < max_vars &&
-                       ((STAT_SUCCESS == status && SNMP_ERR_TOOBIG == response->errstat) || STAT_TIMEOUT == status))
+                       ((STAT_SUCCESS == status && SNMP_ERR_TOOBIG == response->errstat) || STAT_TIMEOUT == status || (STAT_ERROR == status && SNMPERR_TOO_LONG == ss->s_snmp_errno)))
                {
                        /* The logic of iteratively reducing request size here is the same as in function */
                        /* zbx_snmp_get_values(). Please refer to the description there for explanation.  */
@@ -938,7 +938,7 @@
        status = snmp_synch_response(ss, pdu, &response);
        zabbix_log(LOG_LEVEL_DEBUG, "%s() snmp_synch_response() status:%d errstat:%ld mapping_num:%d",
-                       __function_name, status, NULL == response ? (long)-1 : response->errstat, mapping_num);
+                       __function_name, status, NULL == response ? (STAT_ERROR == status ? (long) ss->s_snmp_errno : (long)-1) : response->errstat, mapping_num);
        if (STAT_SUCCESS == status && SNMP_ERR_NOERROR == response->errstat)
        {
@@ -1001,7 +1001,7 @@
                }
        }
        else if (1 < mapping_num &&
-                       ((STAT_SUCCESS == status && SNMP_ERR_TOOBIG == response->errstat) || STAT_TIMEOUT == status))
+                       ((STAT_SUCCESS == status && SNMP_ERR_TOOBIG == response->errstat) || STAT_TIMEOUT == status || (STAT_ERROR == status && SNMPERR_TOO_LONG == ss->s_snmp_errno)))
        {
                /* Since we are trying to obtain multiple values from the SNMP agent, the response that it has to  */
                /* generate might be too big. It seems to be required by the SNMP standard that in such cases the  */

We compile, restart ...

And here is the result:

Network Error has disappeared from the logs.
Hurrah!..

Afterword

Of course, in reality, finding the error and the solution took longer. I had to pick the source code and zabbix and net-snmp more, in the end, to stop at two places in the code.
But the feeling of victory over “inert matter” is priceless.

* desire rolled on October 7, and then 2.2.5 was still masked. Coincidentally, she was masked on October 10;
** do not look at the time, for writing an article he imitated the situation later. During the showdown, there was absolutely no time to pull out the data from the logs, the flow where to go;
*** Yes, yes, I also modeled the picture. Imagine that where there is green in the beginning, everything is red;) And then I drank during restarts.

2014.12.31 UPD: Based on the discussion of the article, a ticket was opened (thanx to alexvl):
failure to send SNMPv3 requests that are “Too long” is not handled properly by SNMP bulk
It has been successfully closed (thanx to Aleksandrs Saveljevs) since versions 2.2.9rc1, 2.4.4rc1, 2.5.0

Tags:

How I patched Zabbix

Afterword

Also popular now: