MPLS is everywhere. How is the Yandex.Cloud network infrastructure

    The post was prepared by: Alexander Virilin xscrew - author, head of the network infrastructure service, Leonid Klyuyev - editor

    We continue to acquaint you with the internal device of Yandex.Cloud . Today we’ll talk about networks - we’ll tell you how the network infrastructure works, why it uses the MPLS paradigm unpopular for data centers, what other complex decisions we had to make in the process of building a cloud network, how we manage it and what kind of monitoring we use.

    The network in the Cloud consists of three layers. The bottom layer is the already mentioned infrastructure. This is a physical “iron” network inside data centers, between data centers and in places of connection to external networks. A virtual network is built on top of the network infrastructure, and network services are built on top of the virtual network. This structure is not monolithic: the layers intersect, the virtual network and network services interact directly with the network infrastructure. Since the virtual network is often called overlay, we usually call the network infrastructure underlay.

    Now the Cloud infrastructure is based in the Central region of Russia and includes three access zones - that is, three geographically distributed independent data centers. Independent - independent of each other in the context of networks, engineering and electrical systems, etc.

    About the characteristics. The geography of the location of the data centers is such that the round-trip time (RTT) of round-trip time between them is always 6–7 ms. The total capacity of the channels has already exceeded 10 terabits and is constantly growing, because Yandex has its own fiber-optic network between the zones. Since we do not lease communication channels, we can quickly increase the capacity of the strip between the DCs: each of them uses spectral multiplexing equipment.

    Here is the most schematic representation of the zones: The

    reality, in turn, is slightly different:

    Here is the current Yandex backbone network in the region. All Yandex services work on top of it, part of the network is used by the Cloud. (This is a picture for internal use, therefore, service information is deliberately hidden. Nevertheless, the number of nodes and connections can be estimated.) The decision to use the backbone network was logical: we could not invent anything, but reuse the current infrastructure - “suffered” over the years of development.

    What is the difference between the first picture and the second? First of all, access zones are not directly related: technical sites are located between them. The sites do not contain server equipment - only network devices for ensuring connectivity are placed on them. Points of presence where Yandex and Cloud connect with the outside world are connected to technical sites. All points of presence work for the entire region. By the way, it is important to note that from the point of view of external access from the Internet, all Cloud access zones are equivalent. In other words, they provide the same connectivity — that is, the same speed and throughput, as well as equally low latencies.

    In addition, there is equipment at the points of presence, to which - if there are on-premise resources and a desire to expand the local infrastructure with cloud facilities - customers can connect through a guaranteed channel. This can be done with the help of partners or on your own.

    The core network is used by the Cloud as an MPLS transport.


    Multi protocol label switching is a technology widely used in our industry. For example, when a packet is transferred between access zones or between an access zone and the Internet, transit equipment pays attention only to the top label, “not thinking” about what's underneath. In this way, MPLS allows you to hide Cloud complexity from the transport layer. In general, we in the Cloud are very fond of MPLS. We even made it part of the lower level and use it directly at the switching factory in the data center:

    (Actually, there are a lot of parallel links between Leaf switches and Spines.)

    Why MPLS?

    True, MPLS is by no means often found in data center networks. Often completely different technologies are used.

    We use MPLS for several reasons. First, we found it convenient to unify the technologies of the control plane and data plane. That is, instead of some protocols in the data center network, other protocols in the core network and the junction of these protocols - a single MPLS. Thus, we unified the technological stack and reduced the complexity of the network.

    Secondly, in the Cloud, we use various network appliances, such as Cloud Gateway and Network Load Balancer. They need to communicate with each other, send traffic to the Internet and vice versa. These network appliances can be scaled horizontally with increasing load, and since the Cloud is built according to the hyperconvergence model, they can be launched anywhere from the point of view of the network in the data center, that is, in a common resource pool.

    Thus, these appliances can start behind any port of the rack switch where the server is located, and begin to communicate via MPLS with the rest of the infrastructure. The only problem in building such an architecture was the alarm.


    The classic MPLS protocol stack is quite complex. This, by the way, is one of the reasons for the non-proliferation of MPLS in data center networks.

    We, in turn, did not use either IGP (Interior Gateway Protocol), or LDP (Label Distribution Protocol), or other label distribution protocols. Only BGP (Border Gateway Protocol) Label-Unicast is used. Each appliance, which runs, for example, as a virtual machine, builds a BGP session before the rack-mount Leaf switch.

    A BGP session is built at a pre-known address. There is no need to automatically configure the switch to run each appliance. All switches are preconfigured and consistent.

    Within a BGP session, each appliance sends its own loopback and receives loopbacks of the rest of the devices with which it will need to exchange traffic. Examples of such devices are several types of route reflectors, border routers and other appliances. As a result, information on how to reach each other appears on the devices. From the Cloud Gateway through the Leaf switch, the Spine switch and the network to the border router, a Label Switch Path is built. Switches are L3 switches that behave like a Label Switch Router and do not know about the complexity surrounding them.

    MPLS at all levels of our network, among other things, has allowed us to use the concept of Eat your own dogfood.

    Eat your own dogfood

    From a network point of view, this concept implies that we live in the same infrastructure that we provide to the user. Racks in accessibility zones are schematically shown here:

    Cloud host takes the load from the user, contains his virtual machines. And literally, a neighboring host in a rack can carry the infrastructure load from the network point of view, including route reflectors, management, monitoring servers, etc.

    Why was this done? There was a temptation to run route reflectors and all infrastructure elements in a separate fault-tolerant segment. Then, if the user segment had broken down somewhere in the data center, the infrastructure servers would continue to manage the entire network infrastructure. But this approach seemed vicious to us - if we do not trust our own infrastructure, then how can we provide it to our customers? After all, absolutely all the Cloud, all virtual networks, user and cloud services work on top of it.

    Therefore, we abandoned a separate segment. Our infrastructure elements run in the same network topology and network connectivity. Naturally, they run in a triple instance - just like our clients launch their services in the Cloud.

    IP / MPLS factory

    Here is an example diagram of one of the accessibility zones:

    In each accessibility zone there are about five modules, and in each module about a hundred racks. Leaf - rack-mounted switches, they are connected within their module by the Spine level, and inter-module connectivity is provided through the network Interconnect. This is the next level, which includes the so-called Super-Spines and Edge switches, which already connect the access zones. We deliberately abandoned L2, we are only talking about L3 IP / MPLS connectivity. BGP is used to distribute routing information.

    In fact, there are much more parallel connections than in the picture. Such a large number of ECMP (Equal-cost multi-path) connections imposes special monitoring requirements. In addition, there are unexpected, at first glance, limits in the equipment - for example, the number of ECMP groups.

    Server connection

    Due to powerful investments, Yandex builds services in such a way that failure of one server, server rack, module or even a whole data center never leads to a complete stop of the service. If we have any kind of network problems - suppose a rack-mount switch is broken - external users never see this.

    Yandex.Cloud is a special case. We cannot dictate to the client how to build his own services, and we decided to level this possible single point of failure. Therefore, all servers in the Cloud are connected to two rack-mount switches.

    We also do not use any redundancy protocols at the L2 level, but immediately started using only L3 with BGP - again, for reasons of protocol unification. This connection provides each service with IPv4 and IPv6 connectivity: some services work over IPv4, and some services over IPv6.

    Physically, each server is connected by two 25-gigabit interfaces. Here is a photo from the data center:

    Here you see two rack-mount switches with 100-gigabit ports. Divergent breakout cables are visible, dividing the 100-gigabit port of the switch into 4 ports of 25 gigabits per server. We call these cables "hydra".

    Infrastructure management

    The Cloud network infrastructure does not contain any proprietary management solutions: all systems are either open source with customization for the Cloud, or completely self-written.

    How is this infrastructure managed? It’s not that forbidden in the Cloud, but it’s highly discouraged to go to a network device and make any adjustments. There is the current state of the system, and we need to apply the changes: come to some new, target state. “Run a script” through all the glands, change something in the configuration - you should not do this. Instead, we make changes to the templates, to a single source of truth system, and commit our change to the version control system. This is very convenient, because you can always do a rollback, look at the history, find out who is responsible for the commit, etc.

    When we made the changes, configs are generated and we roll them out to the laboratory test topology. From a network perspective, this is a small cloud that completely repeats all existing production. We will immediately see if the desired changes break something: firstly, by monitoring, and secondly, by feedback from our internal users.

    If the monitoring says that everything is calm, then we continue rolling out - but apply the change only to part of the topology (two or more accessibility “do not have the right” to break for the same reason). In addition, we continue to closely monitor the monitoring. This is a rather complicated process, which we will talk about below.

    After making sure everything is fine, we apply the change to the entire production. At any time, you can roll back and return to the previous state of the network, quickly track and fix the problem.


    We need different monitoring. One of the most sought after is monitoring end-to-end connectivity. At any given time, each server should be able to communicate with any other server. The fact is that if there is a problem somewhere, then we want to find out exactly where as early as possible (that is, which servers have problems accessing each other). Ensuring end-to-end connectivity is our primary concern.

    Each server lists a set of all servers with which it should be able to communicate at any given time. The server takes a random subset of this set and sends ICMP, TCP, and UDP packets to all selected machines. This checks whether there are losses on the network, whether the delay has increased, etc. The whole network is “called” within one of the access zones and between them. The results are sent to a centralized system that visualizes them for us.

    Here's what the results look like when everything is not very good:

    Here you can see which network segments there is a problem between (in this case, A and B) and where everything is fine (A and D). Specific servers, rack-mounted switches, modules, and entire availability zones can be displayed here. If any of the above becomes the source of the problem, we will see it in real time.

    In addition, there is event monitoring. We closely monitor all connections, signal levels on transceivers, BGP sessions, etc. Suppose three BGP sessions are built from a network segment, one of which was interrupted at night. If we set up the monitoring so that the fall of one BGP session is not critical for us and can wait until the morning, then monitoring will not wake up network engineers. But if the second of the three sessions falls, an engineer calls automatically.

    In addition to End-to-End and event monitoring, we use a centralized collection of logs, their real-time analysis and subsequent analysis. You can see the correlations, identify problems and find out what was happening on the network equipment.

    The monitoring topic is large enough, there is a huge scope for improvements. I want to bring the system to greater automation and true self-healing.

    What's next?

    We have many plans. It is necessary to improve control systems, monitoring, switching IP / MPLS factories and much more.

    We are also actively looking towards white box switches. This is a ready-made "iron" device, a switch on which you can roll your software. Firstly, if everything is done correctly, it will be possible to “treat” the switches the same way as to the servers, build a really convenient CI / CD process, incrementally roll out configs, etc.

    Secondly, if there are any then problems, it’s better to keep a group of engineers and developers who will fix these problems than wait a long time for a fix from the vendor.

    To make it work, work is being done in two directions:

    • We significantly reduced the complexity of the IP / MPLS factory. On the one hand, the level of the virtual network and automation tools from this, on the contrary, have become a bit more complicated. On the other hand, the underlay network itself has become easier. In other words, there is a certain “amount” of complexity that cannot be saved. It can be "thrown" from one level to another - for example, between network levels or from the network level to the application level. And you can correctly distribute this complexity, which we are trying to do.
    • And of course, we are finalizing our set of tools for managing the entire infrastructure.

    This is all we wanted to talk about our network infrastructure. Here is a link to the Cloud Telegram channel with news and tips.

    Also popular now: