Admin bikes: in pursuit of tunnel fragmentation in the overlay network

    Lyrical introduction

    When administrators encounter an unexpected problem (it used to work, and, suddenly, after the update, it stopped), they have two possible behavior algorithms: fight or flight. That is, either to understand the problem to the bitter end, or to escape from the problem without delving into its essence. In the context of a software update, roll back.

    Roll back after an unsuccessful upgrade - this can be said to be a sad best practice. There are whole manuals on how to prepare for a rollback, how to carry them out, and what to do if you failed to roll back. The whole industry of cowardly behavior.

    An alternative way is to understand the last. This is a very difficult way in which no one promises success, the amount of effort expended will not be comparable with the result, and the output will be only a little more understanding of what happened.

    The plot of drama

    Webzill's Instant Servers Cloud. Routine update of the nova-compute host. New live image (we use PXE-loading), the fulfilled chef. All is well. Suddenly, a complaint from the client: “one of the virtual machines works strangely, it seems to work, but as the real load begins, so everything freezes.” We transfer client instances to another node, the client problem is solved. Our problem begins. We launch an instance on this note. Picture: ssh login on Cirros is successful, on Ubuntu it freezes. ssh -v indicates that everything stops at the “debug1: SSH2_MSG_KEXINIT sent” stage.

    All possible external debugging methods work - metadata is obtained, the DHCP lease is updated by the instance. There is a suspicion that the instance does not receive the DHCP option with MTU. Tcpdump indicates that the option is being sent, but it is not known whether the instance accepts it.

    We really want to get to the instance, but on Cirros, where we can get, the MTU is correct, and on Ubuntu, in relation to which there is a suspicion about the MTU problem, we just can not get. But really want to.

    If this is a problem with MTU, then we have a sudden helper. This is IPv6. Despite the fact that we do not allocate “white” IPv6 (sorry, it is not yet production-ready in openstack), link-local IPv6 works.

    We open two consoles. One per network node. We penetrate the network namespace:

    sudo stdbuf -o0 -e0 ip net exec qrouter-271cf1ec-7f94-4d0a-b4cd-048ab80b53dc / bin / bash

    (stdbuf allows you to disable buffering on ip net, due to which the output on the screen appears in real time, and not with a delay, ip net exec executes the code in the given network namespace, bash gives us a shell).

    On the second console open compute-node, we cling to tcpdump tap'u our ubuntu: tcpdump -ni tap87fd85b5-65.

    From inside namespace, we make a request for an all-nodes link-local multicast (this article is not about ipv6, but the essence of what is going on in brief: each node has an automatically generated ipv6 address starting with FE80 ::, in addition, each node listens on multicast addresses and responds depending on the role of the node, the list of multicasts is different, but each node, at least, responds to all-nodes, that is, to the address FF02 :: 1). So, we do multicast ping:

    ping6 -I qr-bda2b276-72 ff02 :: 1
    PING ff02 :: 1 (ff02 :: 1) from fe80 :: f816: 3eff: fe0a: c6a8 qr-bda2b276-72: 56 data bytes
    64 bytes from fe80 :: f816: 3eff: fe0a: c6a8: icmp_seq = 1 ttl = 64 time = 0.040 ms
    64 bytes from fe80 :: f816: 3eff: fe10: 35e7: icmp_seq = 1 ttl = 64 time = 0.923 ms (DUP!)
    64 bytes from fe80 :: f816: 3eff: fe4a: 8bca: icmp_seq = 1 ttl = 64 time = 1.23 ms (DUP!)
    64 bytes from fe80 :: 54e3: 5eff: fe87: 8637: icmp_seq = 1 ttl = 64 time = 1.29 ms (DUP!)
    64 bytes from fe80 :: f816: 3eff: feec: 3eb: icmp_seq = 1 ttl = 255 time = 1.43 ms (DUP!)
    64 bytes from fe80 :: f816: 3eff: fe42: 8927: icmp_seq = 1 ttl = 64 time = 1.90 ms (DUP!)
    64 bytes from fe80 :: f816: 3eff: fe62: e6b9: icmp_seq = 1 ttl = 64 time = 2.01 ms (DUP!)
    64 bytes from fe80 :: f816: 3eff: fe4d: 53af: icmp_seq = 1 ttl = 64 time = 3.66 ms (DUP!)

    The question arises - who is who? In turn, trying to enter is inconvenient and long.

    Next to us in the next window is tcpdump, listening to the interface of the instance of interest to us. And we see in it the answer from only one IP - the IP that interests us. This turns out to be fe80 :: f816: 3eff: feec: 3eb.

    Now we want to connect via ssh to this node. But anyone who has tried the team ssh fe80::f816:3eff:feec:3ebwill have a surprise - "Invalid argument".

    The reason is that link-local addresses cannot be used "just like that", they only make sense within the link (interface). But ssh does not have the option “use such-and-such outgoing IP / interface-and-such”! Fortunately, there is an option to specify the interface name in the IP address.

    We dossh fe80::f816:3eff:feec:3eb% qr-bda2b276-72- and we appear on a virtualka. Yes, yes, I understand your indignation and bewilderment (if you do not have it - you are not a real geek, or you have many years of work with IPv6). “Fe80 :: f816: 3eff: feec: 3eb% qr-bda2b276-72” is such an “IP address”. I do not have enough language to convey the degree of sarcasm in these quotation marks. IP address, with percent and interface name. It’s interesting what happens if someone uploads something like http: // [fe80 :: f816: 3eff: feec: 3eb% eth1] /secret.file from a server in a web server locale to some website ...

    ... And we find ourselves on a virtual machine. Why? Because IPv6 is better than IPv4 to handle bad MTU situations, thanks to the mandatory PMTUD . So, we are on a virtual machine.

    I expect to see the wrong MTU value, go to the cloud-init logs and figure out why. But here is the surprise - MTU is correct. Oops

    In the wilds of debugging

    Suddenly, a problem from a local and understandable one becomes completely incomprehensible. The MTU is correct, but the packets are dropping. ... If you think carefully, the problem was not so simple from the very beginning - the instance migration should not have changed the MTU for it.

    The agonizing debugging begins. Armed with tcpdump, ping and two instances (plus network namespace on the network node), we figure out:

    • Locally, two instances on the same compute each other ping with ping of maximum size.
    • An instance from a network node does not respond (hereinafter - with ping of maximum size)
    • Network node instances on other computers pings.
    • Close attention to tcpdump inside the instance shows that when the network node pings the instance, it sees the pings and responds.

    Oops A large package arrives, but gets lost on the way back. I would say asymmetric routing, but what the hell is routing when they are in neighboring switch ports?

    Close attention to the answer: the answer is visible on the instance. The answer is visible on tap. But the answer is not visible in the network namespace. And how are things going with mtu and packets between the network node and the computer? (internally, I am already triumphing, they say, I found a problem). Rraz - and (large) pings go.

    What? (and a long puzzled pause).

    What to do next is not clear. I come back to the original problem. MTU is bad. What MTU is good? We begin to experiment. Bisection: minus 14 bytes from the previous value. Minus fourteen bytes. Why on earth? After software upgrade? I make vimdiff a list of packages, I find a pleasant prospect to deal with about 80 updated packages, including the kernel, ovs, libc, and a bunch more. So, there are two ways to retreat: lower the MTU by 14 bytes, or roll back and tremble over any update.

    Let me remind you that the client reported the problem, not monitoring. Since MTU is a client setting, “not passing large packets with the DF flag” is not a complete infrastructure problem. That is not an infrastructure problem at all. That is, if it is not caused by an upgrade, but by the upcoming solar eclipse and yesterday's rain, then we won’t even know about the problem returning until someone complains. Shiver over the update and fear the unknown, which you will not know in advance? Thank you, the prospect that I have been dreaming of all my professional life. And even if we lower the MTU, then why fourteen bytes? And if tomorrow is twenty? Or will oil go down in price to 45? How to live with it?

    However, we check. Indeed, the MTU is slightly lower in the DHCP options, and the rebooted instance works just fine. But this is not an option. WHY?

    We start all over again. We return the old MTU, the tcpdump trace package again: the answer is visible on the interface of the instance, on tap'e ... We look at tcpdump on the network interface of the node. A bunch of small annoying floods, but with grep, we see that requests come (inside GRE), but the answers do not go back.


    At least you can see that it is lost somewhere in the process. But where? I decide to compare the behavior with a live node. But the trouble is, on the “live” node tcpdump shows us the packages. Thousands of them. In a millisecond. Welcome to the tengigabitethernet era. Grep allows you to catch something from this flood, but you won’t be able to get a normal dump anymore, and the performance of this design raises questions.

    We focus on the problem: I do not know how to filter traffic using tcpdump. I know how to filter by source, dest, port, proto, etc., but how to filter a packet by IP address inside GRE - I don't know at all. Moreover, Google knows it quite poorly.

    Until a certain point, I ignored this issue, believing that repairing is more important, but the lack of knowledge started to bite very painfully. A colleague ( kevit , whom I drew to the question, dealt with it. Sent a link tcpdump -i eth1 'proto gre and ( ip[58:4] = 0x0a050505 or ip[62:4] = 0x0a050505 )'.

    Wow. Hardcore 0xhex in my webdvangled cloud singularities. Well, it’s possible to live.

    Unfortunately, the rule didn’t work correctly or didn’t work in full swing. Grabbing the idea, using the brute method force I caught the required offsets: 54 and 58 for source and dest IPs. Although kevitshowed where he got the offsets - and it looked damn convincing. IP header, GRE, IP header.

    Important achievement: I got a tool for precision looking at single packets in a multi-gigabyte flood. We look at the packages ... Anyway, nothing is clear.

    Tcpdump is our friend, but wireshark is more convenient. (I know about tshark, but it is also inconvenient). We do a packet dump (tcpdump -w dump, now we can do it), drag it to our machine and begin to sort it out. I decided for myself to deal with biases (out of general corrosivity). We open it in a wireshark and see ...

    We look at the size of the headers and make sure that the correct offset of the beginning of the IP packet is 42, not 46. Having written this error to someone's carelessness, I decided to continue to figure it out the next day, and went home .

    Already somewhere near the house it dawned on me. If the initial assumptions about the structure of the headers are incorrect, then this means that the overhead from GRE when tunneling is different.

    Ethernet header, vlan, IP header, GRE header, encapsulated IP packet ...

    Stop. But the picture has a completely different headline. GRE in neutron does not encapsulate IP packets at all, but ethernet frames. In other words, the initial assumptions about which part of the MTU GRE eats on itself are incorrect. GRE "takes" 14 bytes more than we expected.

    A neutron builds an overlay network over IP using GRE, and it is an L2 network. Of course, there should be encapsulated ethernet headers.

    That is, MTU should be 14 bytes less. From the very beginning. When we planned the network, assumptions about MTU reduction due to GRE, we made a mistake. Pretty serious, as it caused packet fragmentation.

    Okay, the error is clear. But why did it stop working after the update? According to previous studies, it became clear that the problem is related to MTU, incorrect counting of the GRE header and fragmentation of GRE packets. Why did fragmented packets stop passing?

    A careful and close tcpdump showed the answer: GRE began to be sent with the DNF (do not fragment) flag. The flag appeared only on GRE packets that encapsulated IP packets with the DNF flag inside, that is, the flag was copied to GRE from its payload.

    To be sure, I looked at the old nodes - they fragmented GRE. There was a main packet, and a tail with 14 bytes of payload. Here's a blunder ... It

    remains to find out why it started after the upgrade.

    Reading documentation

    The most suspicious regression packages were Linux and Openvswitch. Readme / changelog / news didn’t clarify anything special, but the git inspection (here is the answer, why do we need open source code - in order to have access to the Documentation) revealed something extremely interesting:

    commit bf82d5560e38403b8b33a1a846b2fbf4ab891af8
    Author: Pravin B Shelar 
    Date: Mon Oct 13 02:02:44 2014 -0700
        datapath: compat: Fix compilation 3.11
        Kernel 3.11 is only kernel where GRE APIs are available but
        not vxlan. Add check for vxlan xmit to detect this case.

    The patch itself does not represent anything interesting and does not apply to the essence of the matter, but it gives a hint: GRE API in the kernel. And we have an upgrade from 3.8 to 3.13 just happened. Google in bing ... We find the patch in openvswitch (datapath module), in the kernel: . In other words, as soon as our kernel starts providing GRE services, the openvswitch kernel module passes gre processing to the ip_gre kernel module. We study the ip_gre.c code, thanks for the comments in it, yes, we all “love” the tsiska.

    Here is the coveted line:

    static int ipgre_fill_info(struct sk_buff *skb, const struct net_device *dev)
        struct ip_tunnel *t = netdev_priv(dev);
        struct ip_tunnel_parm *p = &t->parms;
        if (nla_put_u32(skb, IFLA_GRE_LINK, p->link) ||
            nla_put_u8(skb, IFLA_GRE_PMTUDISC,
                   !!(p->iph.frag_off & htons(IP_DF))))

    In other words, the kernel copies IP_DF from the header of the encapsulated packet.

    (A sudden interesting offtopic: Linux also copies TTL from the original package, that is, the GRE tunnel “inherits” TTL from the encapsulated package)

    Dry squeeze

    The plane crashed because the Earth turned in the direction of flight.

    During the initial setup of the installation, we set the MTU for virtual machines as part of an erroneous assumption. Due to the fragmentation mechanism, we escaped with a slight degradation in performance. After upgrading the kernel from 3.8 to 3.13, OVS switched to the ip_gre.c kernel module, which copies the do not fragment flag from the original IP packet. Large packets that did not "fit" into the MTU after appending a header to them no longer fragmented, but dropped. Due to the fact that GRE was dropping, and not a packet enclosed in it, none of the parties to the TCP session (sending packets) received ICMP “obstruction” alerts, that is, could not adapt to a smaller MTU. IPv6, in turn, did not expect fragmentation (it is not in IPv6) and handled the loss of large packets in the right way - reducing the size of the packet.

    Who is to blame and what to do?

    We are to blame - mistakenly set the MTU. The barely noticeable behavior in the software led to the fact that the error began to disrupt the operation of IPv4.

    What to do? We corrected the MTU in the dnsmasq-neutron.conf settings (option dhcp-option-force=26,), gave the clients "stand up" (renew the lease of addresses via DHCP, together with the option), the problem was completely resolved.

    Can this be detected proactively by monitoring? Honestly, I don’t see any reasonable options - the diagnostics are too delicate and complicated, requiring extreme cooperation from client instances (we can’t rely on this - all of a sudden, someone, according to his own needs, will write something strange using iptables?) .

    Lyrical conclusion

    Instead of cowardly rolling back to the previous version of the software and taking the position “works - don’t touch”, “I don’t know what will change if we update, so we will never be updated again”, it took about 2 people a day to debug, but not only local (visible) regression was solved, but an error was found and fixed in the existing configuration, which increased the overhead from the network. In addition to eliminating the problem, understanding of the technologies used has significantly improved, a technique has been developed for debugging network problems (filtering traffic in tcpdump by fields inside GRE).

    Comments - Power

    Suddenly, in the comments, ildarz suggested a great idea on how to find something like this - look at the IP statistics and react to the growing number of fragments (/ proc / net / snmp, netstat -s). I have not yet studied this issue, but it looks very promising.

    Also popular now: