What is EVPN / VXLAN?

    In this article I will tell you what EVPN / VXLAN is and why the features of this technology seem attractive to me for use in the data center. I will not deeply immerse you in technical details, but I will dwell on them only to the extent that it is necessary to get acquainted with the technology. Almost everything that I will touch on in this article is somehow related to the transfer of traffic of the second level of OSI between devices in the same broadcast domain. There are many applied tasks that can be comfortably solved with this possibility, one of the most familiar examples of such a problem is the migration of virtual machines within one or more data centers. And if some time ago, a conversation about this inevitably turned into a plane of discussion of the problems and inconveniences of a common broadcast domain, now, on the contrary,

    The data center industry is characterized by a particularly high density of applications multiplied by consumer expectations, so the choice of network subsystem architecture is especially important. In conversations with customers, I usually suggest transforming the abstract values ​​of the network subsystem’s functioning characteristics into the field of assessing the impact of these values ​​on the client’s business tasks and the risks of an event in the life cycle of the network subsystem. This makes it possible to consider familiar values ​​from the point of view of the customer’s subject area, for example, the abstract 2 seconds of packet dropping during rebuilding of the topology, switchover, or routine actions can turn into quite tangible 2.5 Gigabytes of lost data on each 10G port.

    In a traditional OSI Layer 2 network with multiple paths between switches, we must use STP, MLAG, or stacking. I will briefly describe the individual features of these technologies so that the prerequisites for the emergence of a new architecture of the data center network subsystem become clear. STP reduces effective throughput by simply disabling ports. There are also big claims to this technology in terms of convergence time. STP has other drawbacks, many people know about them no worse than me, so any of the STP implementations does not, under any pretext, have a place in the data center.

    From the point of view of considering the development prospects for MLAG, as the core of the data center, many questions arise. You need to clearly understand the fact that the MLAG topology, by definition, cannot contain more than two aggregation switches, and, in a good way, requires a horizontal connection of doubled capacity between them. Due to the lack of segmentation and traffic control capabilities within and between the chassis, the use of this technology in a geographically distributed environment can be difficult. And the determination of the resources necessary to prevent the “Dual Master” event, using the objective analysis of failure scenarios, can extremely overload the solution from the point of view of the hardware component. Although MLAG aggregation switches continue to function separately, they are extremely closely related in terms of physical topology.

    Using the stack as the core of the data center is fraught with the risks of finding all the eggs in one basket, because, despite the attractiveness and apparent simplicity of the overall Control Plane per Stack, there are some errors in the software, and you are not safe from their appearance in your case . The worst part is that the error usually cannot be localized in an acceptable size Fault Domain, the stack is not only one large switch, but also one large Fault Domain. The fact that stack capabilities are often limited by the weakest component, since by definition many switching chip tables should be identical on all devices, should not be overlooked either. For a stack that is functioning in a distributed environment, the same set of fault tolerance issues arise that are briefly described above for the MLAG application. Ring topology, which is limited to some manufacturers of stackable switches, is characterized by potentially more congested stack ports closer to the border. Depending on the profile of your traffic, this may turn out to be a problem that you also need to pay attention to in terms of calculating the actual oversubscription.


    The term VXLAN refers to a virtual extensible local area network, and you can find many similarities and think of VXLAN as well as VLAN. But the difference is that VXLAN is a tunnel (a one-way virtual connection between two switches), more specifically, it is Ethernet over UDP encapsulation.


    The usual transfer of second-level traffic between hosts is based on information about the location of MAC addresses, the table of which is stored by intermediate switches, they update their knowledge of the network using the blind learning mechanism for the passage of traffic. In the VXLAN world, much is changing, for example, for the transfer of second-level traffic, intermediate switches are not at all obliged to operate at the second level within the same broadcast domain. Tunneling Ethernet traffic allows transit switches to abstract from the concept of MAC addresses and perform their functions within the IP factory of a free topology according to routing rules. Each switch in the data center operates completely independently of its neighbors, it resembles the distributed work of the Internet, since routine maintenance on each device is strictly local in nature. A flexible physical topology allows you to clearly adapt to the traffic profile in a particular data center, and the free choice of types and number of connecting ports allows you to build networks with a given level of re-subscription, up to completely non-blocking.


    When the traffic from host A to host B arrives at the border switch as a second-level frame, this frame is packaged inside the IP packet and sent over the IP network to the border switch on the other side so that the unpacking procedure occurs there. The beginning of the era of the separation of network components into transport and service components is considered to be the introduction of BGP Free Core based on MPLS technologies in the networks of telecom operators. In relation to data centers, this trend has materialized relatively recently as a method for solving scalability and simplification of operation problems; the components of this architecture are usually called Underlay, a transport network, and Overlay, a service or virtual one. The architecture of Network Virtualization Overlays is described in more detail in documents [1, 2]. Separation of the functions of the network subsystem of the data center, allows you to configure services separately from the configuration of the transport component. You are no longer required to define Vlan on each transit switch in a chain, instead you describe Overlay networks by setting up the service only where it is actually provided, while the exchange of Overlay network traffic within the service is carried out through tunnels based on the Underlay network. On the other hand, it is quite expected that carrying out activities on the Underlay network such as changing the physical topology, adding / removing switches or communication channels, will have zero effect on the configuration of the service component. At the same time, the core or transport component of the data center network is free from processing and storage of service states,

    Around the same time, major telecom operators and leading manufacturers of network equipment were completing the development of a new standard for the transmission of second-level traffic based on MPLS networks. The VPLS technology has accumulated a critical mass of unmet requirements in such tasks as:

    • Providing flow balancing when connecting a client to two PEs.
    • balancing flows between PEs.
    • provision of redundancy in geographically distributed configurations.
    • ensuring fast convergence.
    • optimization of traffic delivery.
    • decrease in the level of BUM traffic.

    These requirements have been implemented in the new standard called Ethernet VPN, the full text of the proposals can be found in [3].

    A little later it was proposed to adapt the operator model EVPN based on MPLS for use in IP data center networks. To do this, it was necessary to coordinate some standard procedures with the features of VXLAN encapsulation and the hop-by-hop routing paradigm, as well as somewhat supplement the taxonomy of the types of PE devices - the MPLS implementation implies supporting the routing functions of virtual networks with all PEs, and in the VXLAN implementation a simplified PE type appears only for switching inside a virtual network [4].

    VXLAN and EVPN are standards [5, 6] and you can hope for consistent work in a multi-vendor network [7]. The first standard is interesting in that it describes the details of the encapsulation of Ethernet traffic, it refers to those things that we see "on the wire." The second standard is much more extensive, it describes the rules for exchanging information about MAC addresses as IP prefixes and proposes a model for supporting multiple tenants or services based on a common network. For all this, they did not come up with a separate protocol, the authors decided to supplement BGP with new entities. So, to put it simply: VXLAN is encapsulation on a wire or data-plane, EVPN is a set of rules, guided by which the switches transmit second-level traffic over an IP network, i.e., control-plane.

    With only VLAN tags at your disposal, you can operate with the number of services or segments with a 12-bit number limited from above, that is 4096. In VXLAN, 24 bits are allocated for identifying segments, that is, the maximum number of VNI instances (this is an analog VLAN tag) is equal to 16 million. Do not think of the first number as something unattainable, and the second as uselessly large, in a data center network with multiple services, the estimate of the size of the required segments may well approach the upper limit of the number of VLANs. A comparison in terms of this quantitative characteristic resembles IPv4 versus IPv6.

    EVPN / VXLAN supports Layer 2 traffic over multiple paths (ECMP transmission). If you remember that VXLAN is a tunnel, it becomes clear that this property is inherited in a completely natural way. It is enough to indicate the destination of the tunnel in the IP header of the external packet, and for transit switches, it becomes possible to utilize all possible paths using traffic flows. VXLAN encapsulation is specifically designed so that stream transmission in compliance with the packet sequence does not require serious transit packet inspection. For this purpose, the transport layer header is used, the UDP port number serves as the field for increasing the entropy of traffic and the identifier of the stream. This field is filled in at the edge switches, so that transit switches do not have to look deep into the contents of the packet, which reduces the requirements for their switching chips. This means that you can build high-performance IP factories using general-purpose switches.


    I already wrote how the use of Overlay networks affects the scalability of the core of the network subsystem of the data center, let's look at another aspect of this concept - reducing the level of broadcast traffic. This question in the EVPN functionality gives rise to the most misunderstanding, but in fact, everything is simple. EVPN switches use a software method for learning MAC addresses based on BGP messaging; after a new MAC address appears on the access port, switching tables are synchronized throughout the network. By the time traffic starts on the server that is switched on, the switching tables already contain relevant information, so the remote side does not need to replicate the frame with an unknown unicast destination address to all access ports of the data center. This is a significant difference from the classical method of "blind" filling the tables upon the receipt of traffic,


    Programmatic training of MAC addresses makes them more mobile, the standard provides for deviating broadcast broadcasting replication disadvantages for preparing switching tables for moving a MAC address from one access port to another, as is usually the case during a hot migration of a virtual machine.

    The issues of convergence time and reactions to changes in topology must be considered separately in the Overlay and Underlay components. As for the Underlay network, try to adhere to generally accepted practices for constructing routable networks; in relation to the IP factory, they are expressed by a few simple recommendations:

    • use BFD so that transit switches can detect an emergency in less than a second;
    • try to adhere to symmetrical topologies with parallel paths, this will allow the switch at the accident boundary to act as a traffic recovery point even before the information about the accident reaches the traffic source;
    • Choose topologies that allow alternative routes with paths pre-installed in the switching chip without the risk of micro loops.

    Of great interest to the topic of this article is the description of the application of a software model for learning MAC addresses in an Overlay network for a scenario of enabling or disabling an access port. The access switches to which the servers are connected, in addition to the MAC address information, exchange topological information. The set of access ports on two or more switches that are connected to the same server is assigned to the Ethernet Segment Identifier (ESI), unique in the framework of the broadcast domain. This allows remote switches to match the destination MAC address with the Ethernet number of the segment and select the switches that declared themselves connected to this segment to transmit traffic, even if they did not explicitly transmit MAC address information on the access port.


    In the event of a port failure or disconnection, the switch signals an accident using a disconnect message from the Ethernet segment, so remote switches can update the switching table without waiting for the generation and processing of a set of messages about disconnecting all MAC addresses separately.


    The previous text mainly related to the transmission of traffic of the second level of OSI, this is far from the only area of ​​application of EVPN, now I will tell you why this technology is talked about as a method of integrating routing and switching. Let's think about what prevents the presence of two or more routers in one segment of a second-level network. There can be many answers, but one of them sounds something like this - that is why at the second level of OSI there is no way to transmit traffic along multiple paths. The switch needs to know exactly which port to send traffic to the router's MAC address by default, since this address in the classic Ethernet network cannot be active on different ports at the same time. But, as we saw earlier, EVPN developers implemented Active-Active traffic transmission based on a software method for distributing MAC information, therefore, two or more than two switches can announce to their EVPN neighbors the local presence of the default MAC address of the router within the VXLAN network. Inter-network routing can be performed in Active-Active mode on multiple switches that have declared themselves default routers. These devices behave like a distributed router, each part of which has the same IP and MAC addresses or Anycast Gateway, i.e. the problem turns into an advantage. EVPN provides mechanisms for the exchange of not only MAC, but also IP information, so the tabular data and the behavior of the distributed router will be identical on all its components. Inter-network routing can be performed in Active-Active mode on multiple switches that have declared themselves default routers. These devices behave like a distributed router, each part of which has the same IP and MAC addresses or Anycast Gateway, i.e. the problem turns into an advantage. EVPN provides mechanisms for the exchange of not only MAC, but also IP information, so the tabular data and the behavior of the distributed router will be identical on all its components. Inter-network routing can be performed in Active-Active mode on multiple switches that have declared themselves default routers. These devices behave like a distributed router, each part of which has the same IP and MAC addresses or Anycast Gateway, i.e. the problem turns into an advantage. EVPN provides mechanisms for the exchange of not only MAC, but also IP information, so the tabular data and the behavior of the distributed router will be identical on all its components.


    For symmetric data center topology, routing is usually implemented in one of four ways:

    • on all Spine switches;
    • on all Leaf switches;
    • on dedicated Leaf switches;
    • on a router outside the Overlay domain;

    The first option is characterized by a rather simple implementation of the mechanisms for monitoring application activity (Security Insertion) and organization of processing chains (Service Chaining) [8], since the service unit of the network subsystem of the data center interacts with a relatively small number of Spine switches, where traffic delivery policies are implemented.


    On the other hand, routing at the Leaf level looks more attractive from the point of view of the total port transmission and utilization time, but the symmetric connection of such blocks of the data center network subsystem as service and border can cause unjustified difficulties. Therefore, it is often distinguished the so-called service rack or racks, Leaf switches in which completely and completely work as a bridge between the Overlay network and the rest of the blocks of the network subsystem of the data center. At the same time, expectations for Spine level switches are significantly reduced, in fact, they only need to transmit IP traffic, but in practice, these switches usually also function as BGP-RR route reflectors.


    Technically speaking, traffic between VXLAN networks can also be routed outside the Overlay domain; a classic IP router can very well be connected to the EVPN / VXLAN network access port as a client. But this method is devoid of such useful qualities as Active-Active transmission to Anycast-Gateway and does not satisfy the requirements of redundancy, therefore it is used in practice only as an intermediate step, as a method of smoothly migrating the transport component of a data center network from Ethernet to EVPN / VXLAN.

    [1] RFC7364 “Problem Statement: Overlays for Network Virtualization”

    [2] RFC7365 “Framework for Data Center (DC) Network Virtualization”

    [3] RFC7209 “Requirements for Ethernet VPN (EVPN)”

    [4] draft-ietf-bess- evpn-overlay "A Network Virtualization Overlay Solution using EVPN"

    [5] RFC7348 “Virtual eXtensible Local Area Network (VXLAN): A Framework for Overlaying Virtualized Layer 2 Networks over Layer 3 Networks”

    [6] RFC7432 “BGP MPLS-Based Ethernet VPN”

    [7] EANTC “Multi-Vendor Interoperability Test”, 2017

    [8] draft-ietf-bess-service-chaining-03 “Service Chaining using Virtual Networks with BGP VPNs”

    Also popular now: