Openstack Detective story or where does the connection disappear? Part one

This story is about OpenStack + KVM. It all started when everything worked well. The "old" platform satisfied everyone. She was lifted without us, and she was slightly outdated. That was Juno. She also worked.

In principle, it was a test, until one day it became combat. We know did not know the problems that we encountered later. The authorities, happily rubbing their hands, decided to update the fleet of systems. Including the OpenStack test platform.

We decided to deploy it manually, because at that moment there were no fuel solutions for the Mitaka version. Therefore, we deployed everything according to recipes from the official site. Of course, they added a little on their own, for example, replaced Memcached with Couchbase, and took percona in cluster mode as the database. And everything went well. Until a certain point.

We began to lose packages. At first we thought the switch was to blame. On it was Junos of a rather old version - 11, which has known bugs. And on the console, she really had messages confirming our guess. We replaced this hardware with another one, with the new 15th Junos firmware.

Meanwhile, the problem did not disappear, but only began to expand slowly. A common symptom looks like this - pings are suddenly lost. Constantly disconnects.
Depressing for us and customers.

We have one client, it consumes a lot of traffic. And generates in response too much. He has broadcasts from webcams. He began to complain: the connection is lost and that's it.

Here is what we saw on monitoring:

Traffic loss

Indeed - the client is right, something is wrong. But where??? At one of these moments, we found the reason - the wrong ARP glowed on the network. Where is the culprit? The guilty address was found on the issuing firewall. There was a line entered by mistake by the administrator:

set security nat proxy-arp interface xxxx address yy.zz.tt.cc/32

Thank God they found it - it was the first thought. But it was not there. The loss of packets, no matter which tcp, icmp, udp, continued.

We continued to search, and it became clear that the problem was somewhere inside OpenStack. When I began to ping a test virtual machine - I almost fell off my chair:

Weird ping

This meant that for some reason some of the packets were not broadcast, and fell out with gray addresses! Naturally, these packages have not reached anyone.

We will share what we were able to unearth, but later. I would like to see the opinion of a respected public, what we did wrong and where to look

Also popular now: