JohnSelfiedarum July 21, 2017 at 17:47

Openstack Detective story or where does the connection disappear? Part three

Recovery mode

"Who is building like that ?!"

What address the router should have by default on the network is a big question. In fact, nothing prevents him from being any address from the subnet. And the OpenStack writers also decided - let's be the first, what to suffer?

As a result, you do not have time to come to your senses, as everything falls. Why? Because unexpectedly for everyone, default gw is not on the router, as it should be, but on your open source. Customers are calling, the boss is fierce. And you are looking for another reason for the fall. It’s just that a colleague unhooked the existing address for the purpose of replacement, and the openstack turned out to be trickier ...

Life goes on

In some cases, the problem occurs immediately, in some it does not. Let me remind you: the old problem was that part of the IP packets disappeared periodically.

I'll try to justify myself a little. - Often our problems coincided with the presence of external attacks. Moreover, in many cases, it seemed that the problems were in congested channels. In some cases, we exceeded the channel limit and the packets really dropped. This was exacerbated by the presence of infected machines in the platform that generated an incredible amount of internal traffic. Plus, network equipment malfunctions, in which, due to programmer errors, the wrong packages were also killed. In addition, the configuration files are simply huge .

I'm not a robot or a magician - you can understand the functionality of options with thoughtful reading, but whether they are needed in a specific context was completely unclear. I had to intuitively guess, checking the most reasonable assumptions in practice.

Therefore, it was difficult for me and my colleagues to separate and identify the problem. Even worse, there was no problem in the newly created farm. We generated three hundred cars, and everything worked like a clock. Of course, we immediately began to prepare it for “production”. This implied the introduction of "torn" ranges of ip addresses. We cleaned the farm by removing these three hundred cars. And suddenly, with only three test virtual machines, the same thing happened as on the old farm - packets in large numbers began to disappear. So we decided that the problem is somewhere in the depths of OpenStack.

Strange workarounds

In an old farm, we found a relatively simple way to get around this problem. This was done by tearing off the internal IP address and assigning it from a different subnet - we often had to add new subnets. The problem went away for a while. Some of the machines worked well.

The solution is somewhere nearby

During a long investigation, interrupted by design work, distracted by problems from VIPs, we still managed to identify several errors. In addition, these same files differ if you use the controller as a compute node, and if you do not use it. In one of the first successful configurations, we used it. Then they refused it. Part of the settings remained. Thus, two out of nine machines had incorrect settings (not dvr but dvr-snat parameter got to the computing nodes). In the end, I found the right parameter and put it in place.

Without understanding how a virtual router works - where does it get the settings, I had to configure it as well. It, in theory, should be with one address and, accordingly, with one poppy address. Is it logical? We reasoned and set up accordingly with a colleague.

At some point when investigating problems with DHCP (see part 2), I found duplicate poppy addresses. Not one, two, but much more. This is the number!

It was decided to change the base_mac and dvr_base_mac settings. Now in each computer and in each controller, these parameters are different.

From the very beginning, we have not yet turned on l2population - well, our hands just didn’t reach. And in the new farm included. And look, after all such changes - it worked! Not only that - pings ceased to disappear from the word "general"! Previously, no, no, and the bag will disappear just like that - 0.1% and we thought it was generally good. Because it’s much worse when a quarter or even half disappeared.

Settings that did us well - neutron.conf for controller node

root @ mama: ~ # cat /etc/neutron/neutron.conf | grep -v "^ #. *" | strings
[DEFAULT]
bind_host = 192.168.1.4
auth_strategy = keystone
core_plugin = ml2
allow_overlapping_ips = True
service_plugins = router
base_mac = fa 17: a1: 00: 00: 00
notify_nova_on_port_status_changes = true
notify_nova_on_port_data_changes = true
advertise_mtu = true
allow_automatic_dhcp_failover = true
dhcp_agents_per_network = 3
dvr_base_mac = fa: 17: b1: 00: 00: 00
router_distributed = true
allow_automatic_l3agent_failover = true
l3_ha = true
max_l3_agents_per_router = 3
rpc_backend = rabbit
[agent]
root_helper = sudo / usr / bin / neutron-rootwrap /etc/neutron/rootwrap.conf
[database]
connection = mysql + pymysql: // neutron: ZPASSWORDZ @ mama / neutron
[keystone_authtoken]
auth_uri = mama : 5000
auth_url = mama : 35357
memcached_servers = mama: 11230
auth_plugin = password
project_domain_name = default
user_domain_name = default
project_name = service
username = neutron
password = ZPASSWORDZ
[nova]
auth_url = mama : 35357
auth_plugin = password
project_domain_name = default
user_domain_name = default
region
project_name = service
username = nova
password = ZPASSWORDZ
[oslo_messaging_rabbit]
rabbit_userid = openstack
rabbit_password = ZPASSWORDZ
rabbit_durable_queues = true
rabbit_hosts = mama: 5673
rabbit_retry_interval = 1
rabbit_retry_backoff = 2
rabbit_max_retries = 0
rabbit_ha_queues = false
[quotas]
quota_network = 100
quota_subnet = 200
quota_port = - 1
quota_router = 100
quota_floatingip = -1
quota_security_group = -1
quota_security_group_rule = -1

Settings that did us well - neutron.conf for compute node

root@baby:~# cat /etc/neutron/neutron.conf |grep -v "^#.*"|strings
[DEFAULT]
bind_host = 192.168.1.7
bind_port = 9696
auth_strategy = keystone
core_plugin = ml2
allow_overlapping_ips = True
service_plugins = router
base_mac = fa:17:c1:00:00:00
notify_nova_on_port_status_changes = true
notify_nova_on_port_data_changes = true
allow_automatic_dhcp_failover = true
dhcp_agents_per_network = 3
dvr_base_mac = fa:17:d1:00:00:00
router_distributed = true
allow_automatic_l3agent_failover = true
l3_ha = true
max_l3_agents_per_router = 3
rpc_backend = rabbit
[agent]
root_helper = sudo / usr / bin / neutron-rootwrap /etc/neutron/rootwrap.conf
[database]
connection = mysql + pymysql: // neutron: ZPASSWORDZ @ mama / neutron
[keystone_authtoken]
auth_uri = mama : 5000
auth_url = mama : 35357
memcached_servers = mama: 11230
auth_plugin = password
project_domain_name = default
user_domain_name = default
project_name = service
username = neutron
password = ZPASSWORDZ
[nova]
auth_url = mama : 35357
auth_plugin = password
project_domain_name = default
user_domain_name = default
region
service = project_name
username = nova
password = ZPASSWORDZ
[oslo_messaging_rabbit]
rabbit_hosts = mama: 5673
rabbit_userid = openstack
rabbit_password = ZPASSWORDZ
rabbit_durable_queues = true
rabbit_retry_interval = 1
rabbit_retry_backoff = 2
rabbit_max_retries = 0
rabbit_ha_queues = true

We waited patiently for a day (but I wanted to run, shouting “everything worked!”), We applied similar changes in the old farm. The second week - normal flight.

Conclusion

Conclusion: of course, all this would not have happened if we had not configured manually, but through an automated installer. However, the experience gained is invaluable.

Tags: