The architecture of the network load balancer in Yandex.Cloud


    Hi, I’m Sergey Elantsev, I’m developing a network load balancer in Yandex.Cloud. Previously, I led the development of the L7-balancer of the Yandex portal - my colleagues joke that no matter what I do, I get a balancer. I will tell the readers of Habr how to manage the load in the cloud platform, how we see the ideal tool for achieving this goal and how we are moving towards building this tool.

    First, we introduce some terms:

    • VIP (Virtual IP) - balancer IP address
    • Server, backend, instance - a virtual machine with an application running
    • RIP (Real IP) - server IP address
    • Healthcheck - server availability check
    • Availability Zone, AZ - isolated infrastructure in the data center
    • Region - the union of different AZ

    Load balancers solve three main tasks: they perform the balancing itself, improve the fault tolerance of the service and simplify its scaling. Fault tolerance is ensured by automatic traffic control: the balancer monitors the state of the application and excludes instances from the balancing that fail the liveness test. Scaling is ensured by uniform distribution of load across instances, as well as updating the list of instances on the fly. If the balancing is not uniform enough, then some of the instances will get a load that exceeds their working capacity limit, and the service will become less reliable.

    The load balancer is often classified by protocol level from the OSI model it runs on. The Cloud Balancer operates at the TCP level, which corresponds to the fourth level, L4.

    Let's move on to a review of Cloud balancer architecture. We will gradually increase the level of detail. We divide the balancer components into three classes. The config plane class is responsible for user interaction and stores the target state of the system. The control plane stores the current state of the system and manages systems from the data plane class, which are directly responsible for delivering traffic from clients to your instances.

    Data plane


    Traffic falls on expensive devices called border routers. To increase fault tolerance, several such devices simultaneously work in one data center. Further, the traffic falls on the balancers, who announce anycast IP address to all AZs via BGP for clients. 



    Traffic is transmitted via ECMP - this is a routing strategy according to which there may be several equally good routes to the destination (in our case, the destination will be the destination IP address) and packets can be sent to any of them. We also support work in several access zones according to the following scheme: we announce an address in each of the zones, traffic falls into the nearest one and does not go beyond it already. Further in the post we will examine in more detail what happens to traffic.

    Config plane

     
    A key component of the config plane is the API through which the basic operations with balancers are performed: creating, deleting, changing the composition of instances, obtaining healthchecks results, etc. On the one hand, this is a REST API, and on the other, we often use the framework in the Cloud gRPC, so we “translate” REST into gRPC and then use only gRPC. Any request leads to the creation of a series of asynchronous idempotent tasks that are performed on a shared pool of Yandex.Cloud workers. Tasks are written in such a way that they can be suspended at any time, and then restarted. This provides scalability, repeatability and logging operations.



    As a result, the task from the API will make a request to the balancer service controller, which is written in Go. He can add and remove balancers, change the composition of backends and settings. 



    The service stores its state in Yandex Database - a distributed managed database that you will be able to use soon too. In Yandex.Cloud, as we have already said , the dog food concept operates: if we ourselves use our services, then our customers will also be happy to use them. Yandex Database is an example of the implementation of such a concept. We store all our data in YDB, and we do not have to think about maintaining and scaling the database: these problems are solved for us, we use the database as a service.

    We return to the balancer controller. Its task is to save information about the balancer, send the task of checking the readiness of the virtual machine to the healthcheck controller.

    Healthcheck controller


    It receives requests for changing inspection rules, saves them in YDB, distributes tasks to healtcheck nodes and aggregates the results, which are then saved to the database and sent to the loadbalancer controller. He, in turn, sends a request to change the composition of the cluster in the data plane to loadbalancer-node, which I will discuss below.



    Let's talk more about healthchecks. They can be divided into several classes. Audits have different success criteria. TCP checks need to successfully establish a connection in a fixed time. HTTP checks require both a successful connection and a response with a status code of 200.

    Also, the checks differ in the class of action - they are active and passive. Passive checks simply monitor what happens to the traffic without taking any special action. This does not work very well on L4, because it depends on the logic of the higher-level protocols: on L4 there is no information about how long the operation took and whether the connection was good or bad. Active checks require the balancer to send requests to each server instance.

    Most load balancers perform liveliness checks on their own. We at Cloud decided to separate these parts of the system to increase scalability. This approach will allow us to increase the number of balancers, while maintaining the number of healthcheck-requests to the service. Checks are performed by separate healthcheck nodes, which are used to shard and replicate test targets. It is impossible to do checks from one host, as it may fail. Then we will not get the state of the instances that he checked. We perform checks on any instance from at least three healthcheck nodes. The targets of checks we shard between nodes using consistent hashing algorithms.



    The separation of balancing and healthcheck can lead to problems. If the healthcheck node makes requests to the instance, bypassing the balancer (which does not currently serve traffic), then a strange situation arises: the resource seems to be alive, but the traffic will not reach it. We solve this problem this way: we are guaranteed to get healthcheck traffic through balancers. In other words, the scheme of moving packets with traffic from clients and from healthchecks differs minimally: in both cases, packets will go to the balancers, which will deliver them to the target resources.

    The difference is that clients make requests for VIPs, and healthchecks refer to each individual RIP. Here an interesting problem arises: we give our users the opportunity to create resources in gray IP networks. Imagine that there are two different cloud owners who hid their services for balancers. Each of them has resources in the 10.0.0.1/24 subnet, moreover, with the same addresses. You need to be able to distinguish them in some way, and here you need to dive into the device of the Yandex.Cloud virtual network. You can find out more details in the video from the about: cloud event , it’s important for us now that the network is multi-layered and has tunnels that can be distinguished by subnet id.

    Healthcheck nodes access balancers using so-called quasi-IPv6 addresses. A quasi-address is an IPv6 address within which the IPv4 address and user subnet id are protected. Traffic falls on the balancer, it extracts the IPv4 address of the resource from it, replaces IPv6 with IPv4, and sends the packet to the user's network.

    The reverse traffic goes the same way: the balancer sees that the destination is a gray network from healthcheckers, and converts IPv4 to IPv6.

    VPP - the heart of the data plane


    The balancer is implemented on the technology of Vector Packet Processing (VPP) - a framework from Cisco for packet processing of network traffic. In our case, the framework runs on top of the library of user-space-management of network devices - Data Plane Development Kit (DPDK). This provides high packet processing performance: much fewer interruptions occur in the kernel, there are no context switches between kernel space and user space. 

    VPP goes even further and squeezes even more performance out of the system by combining packages into batches. Increased productivity is due to the aggressive use of caches of modern processors. Both data caches are used (packets are processed by "vectors", data is close to each other) and instruction caches: in VPP, packet processing follows a graph, in the nodes of which there are functions that perform one task.

    For example, the processing of IP packets in VPP takes place in the following order: first, packet headers are parsed in the parsing node, and then they are sent to the node that forwards the packets further according to the routing tables.

    A bit of hardcore. VPP authors do not compromise on the use of processor caches, so the typical package vector processing code contains a manual vectorization: there is a processing cycle in which the situation is processed like “we have four packets in the queue”, then the same for two, then - for one. Often prefetch instructions are used that load data into caches to speed access to them at the following iterations.

    n_left_from = frame->n_vectors;
    while (n_left_from > 0)
    {
        vlib_get_next_frame (vm, node, next_index, to_next, n_left_to_next);
        // ...
        while (n_left_from >= 4 && n_left_to_next >= 2)
        {
            // processing multiple packets at once
            u32 next0 = SAMPLE_NEXT_INTERFACE_OUTPUT;
            u32 next1 = SAMPLE_NEXT_INTERFACE_OUTPUT;
            // ...
            /* Prefetch next iteration. */
            {
                vlib_buffer_t *p2, *p3;
                p2 = vlib_get_buffer (vm, from[2]);
                p3 = vlib_get_buffer (vm, from[3]);
                vlib_prefetch_buffer_header (p2, LOAD);
                vlib_prefetch_buffer_header (p3, LOAD);
                CLIB_PREFETCH (p2->data, CLIB_CACHE_LINE_BYTES, STORE);
                CLIB_PREFETCH (p3->data, CLIB_CACHE_LINE_BYTES, STORE);
            }
            // actually process data
            /* verify speculative enqueues, maybe switch current next frame */
            vlib_validate_buffer_enqueue_x2 (vm, node, next_index,
                    to_next, n_left_to_next,
                    bi0, bi1, next0, next1);
        }
        while (n_left_from > 0 && n_left_to_next > 0)
        {
            // processing packets by one
        }
        // processed batch
        vlib_put_next_frame (vm, node, next_index, n_left_to_next);
    }

    So, Healthchecks are turning over IPv6 to VPP, which turns them into IPv4. This is done by the graph node, which we call algorithmic NAT. For reverse traffic (and conversion from IPv6 to IPv4) there is the same node of algorithmic NAT.



    Direct traffic from the balancer clients goes through the nodes of the graph, which perform the balancing itself. 



    The first node is sticky sessions. It stores a 5-tuple hash for established sessions. 5-tuple includes the address and port of the client from which information is transmitted, the address and ports of the resources available for receiving traffic, as well as the network protocol. 

    The 5-tuple hash helps us perform less computation in the subsequent consistent hash node and also better handle the change in the list of resources behind the balancer. When a packet arrives at the balancer for which there is no session, it is sent to the consistent hashing node. This is where balancing occurs using consistent hashing: we select a resource from the list of available "live" resources. Then the packets are sent to the NAT node, which actually replaces the destination address and recalculates the checksums. As you can see, we follow the rules of VPP - similar to similar, group similar calculations to increase the efficiency of processor caches.

    Consistent Hashing


    Why did we choose him and what is it all about? To begin, consider the previous task - selecting a resource from the list. 



    With non-consistent hashing, the hash from the incoming packet is calculated, and the resource is selected from the list by the remainder of dividing this hash by the number of resources. As long as the list remains unchanged, such a scheme works well: we always send packets with the same 5-tuple to the same instance. If, for example, some resource stops responding to healthchecks, then for a significant part of the hashes the choice will change. TCP connections will break at the client: a packet that previously went to instance A may begin to fall to instance B, which is not familiar with the session for this packet.

    Consistent hashing solves the described problem. The easiest way to explain this concept is as follows: imagine that you have a ring into which you distribute resources by hash (for example, by IP: port). The choice of a resource is the rotation of the wheel by an angle, which is determined by the hash from the packet.



    This minimizes the redistribution of traffic when changing the composition of resources. Deleting a resource will affect only that part of the consistent hash ring on which the given resource was located. Adding a resource also changes the distribution, but we have a sticky sessions node that allows us not to switch already established sessions to new resources.

    We examined what happens to direct traffic between the balancer and the resources. Now let's deal with reverse traffic. It follows the same pattern as the verification traffic — through algorithmic NAT, that is, through reverse NAT 44 for client traffic and through NAT 46 for healthchecks traffic. We adhere to our own scheme: we unify healthchecks traffic and real user traffic.

    Loadbalancer-node and components assembly


    The composition of balancers and resources in VPP is reported by the local service - loadbalancer-node. He subscribes to the flow of events from the loadbalancer-controller, is able to build the difference between the current state of the VPP and the target state received from the controller. We get a closed system: events from the API come to the balancer controller, which sets the healthcheck-controller tasks to check the "liveness" of resources. That, in turn, sets tasks in the healthcheck-node and aggregates the results, after which it sends them back to the balancer controller. The loadbalancer-node subscribes to events from the controller and changes the state of the VPP. In such a system, each service only knows what it needs about neighboring services. The number of connections is limited, and we have the opportunity to independently exploit and scale the various segments.



    What questions were avoided


    All of our services in the control plane are written in Go and have good scaling and reliability features. Go has many open source libraries for building distributed systems. We actively use GRPC, all components contain an open source implementation of service discovery - our services monitor each other's performance, can change their composition dynamically, and we tied it with GRPC-balancing. For metrics, we also use an open source solution. In the data plane, we got decent performance and a large reserve of resources: it turned out to be very difficult to assemble a stand on which one could rest on the performance of VPP, and not an iron network card.

    Problems and Solutions


    What didn’t work very well? In Go, memory management is automatic, but memory leaks do happen. The easiest way to deal with them is to launch goroutines and do not forget to complete them. Conclusion: monitor the memory consumption of Go-programs. Often a good indicator is the amount of goroutine. There is a plus in this story: Go can easily get data on runtime - on memory consumption, on the number of running goroutines and on many other parameters.

    In addition, Go may not be the best choice for functional tests. They are quite verbose, and the standard “run everything in CI package” approach is not very suitable for them. The fact is that functional tests are more demanding on resources, with them there are real timeouts. Because of this, tests may fail because the CPU is busy with unit tests. Conclusion: if possible, perform “heavy” tests separately from unit tests. 

    Microservice event architecture is more complicated than a monolith: grabbing logs on dozens of different machines is not very convenient. Conclusion: if you are doing microservices, immediately think about tracing.

    Our plans


    We will launch the internal balancer, IPv6-balancer, add support for Kubernetes scripts, continue to shard our services (now only healthcheck-node and healthcheck-ctrl are shaded), add new healthchecks, and also implement smart check aggregation. We are considering the possibility of making our services even more independent - so that they do not communicate directly with each other, but using a message queue. The SQS-compatible Yandex Message Queue service has recently appeared in the Cloud .

    Recently, Yandex Load Balancer has been publicly released. Study the documentation for the service, manage the balancers in a way convenient for you and increase the fault tolerance of your projects!

    Also popular now: