I am a network architect and it bothers me
100G with a breakdown of 4x25 in our data center
The basis of the work of a network architect on * aaS projects is how to build a building that is evolving. It was like a five-story building, when they built four floors, it became necessary to do another 21, then it became necessary to attach houses connected by underground tunnels, and then all this should become a huge residential complex with a covered courtyard. And still there are residents inside, and they can not block the sewer, water supply and porches.
Well, yes. And there are current problems of network standards (ten years behind real requirements). Most often, this means inventing tricky bicycles instead of applying obvious, it would seem, solutions. But bicycles are everywhere, of course.
There is a cloud platform and services next to it that are interconnected. Cloud servers stick into the network. Devices that provide the network itself are stuck in the same network. Cloud client equipment sometimes sticks into the same network. And much more is stuck, for example, channels on the Internet. That is, there must be a network, and someone must be responsible for it. Since the project is rather big, with all sorts of complex relationships (there is an OpenStack part and a VMware part, or some other * aaS), it’s not enough to simply set up. We need some kind of entity that will take care of all this and be responsible. Of course, this is exploitation - we have great fellows. But their main task is to exploit.
My task is to know what will happen in a couple of years and what other architects and product experts want to build such a network where all their parts of the puzzle will connect normally. I also meet with vendors, contractors and customers - I listen to everyone, collect in my head the thought of how to do better with respect to resources, taking into account all the information.
Why is a network architect needed, and not a general system architect? Because he can be a systems architect, just highly specialized in network technologies. This is just a name that emphasizes the specifics. A network architect translates business tasks into technology and money into processes. He reports "up" about his decision in detail, and if the business likes it - everything is ok. Then I write documentation, put the system into operation.
What is special about a network architect? In my opinion, he should:
- have a number of years of experience with complex networks geographically distributed both within enterprises and data centers, and between them. It is difficult to determine the number of years: one person in 2 years can see, feel and do as much as another and cannot do in 10 years;
- Know key network protocols. Key - depend on the place of application of the architect;
- understand how systems work with applications and what role the network plays in their work, how one affects the other and in opposite directions;
- be able to interact with employees of different levels: from managers through other architects to technical support, often speaking as a translator, a “bridge” between often opposing tasks (for example, to reduce staff and not buy hardware / software / services vs expand staff, buy spare parts / hardware / software / services). You need to be able to write, and speak, and talk, and this should not annoy or turn people on. It’s not enough just to know the protocols, you need to be able to explain them.
What does the job look like?
We had a network architecture created historically. Once we built a cloud, as we thought was right, according to the then realities, and this became the basis. Over time, the infrastructure grew, changed - and after a while it ceased to please us as internal customers. Otherwise, we began to see our network in a couple of years, and began to see how and what would become a bottleneck. Simply put, at some point it became obvious to us that it was necessary to change a number of qualities of the network. And while maintaining value, preferably.
A committee of specialists came. System architects gave their introduction and wishes, received options and nuances, jointly discussed limitations and development. And I stayed to calculate the solution. And accompany him.
During the development of the infrastructure, new technologies appeared, so I appreciated them, appreciated our accumulated experience, various overheads were superfluous - and suggested what should be changed and how. System architects threw off their vision in terms of performance, speed and scalability - I took this into account. Operation added Wishlist and wishes - took this into account.
One of the requirements set by system architects was the ability to enable 25 Gb / s Ethernet ports. Previously, everything was done on N * 10 Gb, and the 40 Gbit / s interface was a set of 4 at 10 Gbit / s, and 100 Gbit / s - 10 at 10 Gbit / s. Now you can get a hundred 4x25, which simplifies operation, and fewer requirements for SCS. Yes, and in the protocols I wanted to dig something. Here is the target:
25G Ethernet - it does not differ much in price, in network and server equipment, than 10GbE links. In terms of port, the price is almost the same, but the increase in speed is very pleasant. And customers will definitely need it in the coming years. At the same time, we switched to the joints between the 100 GB switches. On a network inside a data center with a strip is much simpler than between data centers - optics and transceivers are cheaper.
DAC cables of 10 and 25 Gb / s - no way to distinguish
The target architecture is built on topology CLOS (Leaf & Spine). Here’s the old Spanning Tree switch merge technology:
Here’s the new one:
As you can see, each switch is independent on its own. There is a control plane (control) and data plane (data transmission). The control level is determined by the switch OS and routing and signaling protocols. The data transfer layer is characterized by what the switch does with data transmitted from port to port.
Historically, to increase the port capacity and mainly for the ability to collect LAG (link aggregation group - a fail-safe link of two or more physical between different switches), either stand-alone switches or switches (interface cards) in a single chassis chassis were used, or separate switches were combined into the virtual chassis. Everything is convenient in the virtual chassis: a single control plane combines different switches under a single control, and the engineer sees the individual switches as interface cards in one chassis. But the chassis also has its drawbacks: the tight tying of the control plane can lead to the failure of all switches in the virtual chassis. This does not mean that there is no place for a virtual chassis in this world, but for our tasks and processes they are less convenient and reliable.
The Spine level is only a transfer from one node to another, they do not have client connections.
Spanning Tree is an old and time-tested protocol whose main purpose is to avoid loops in ring topologies. Where do the loops come from? Look at the picture above. Loops are the result of link redundancy. Links that can lead to rings are blocked - they will be used only in case of an accident and after rebuilding the tree. You can twist the STP timers, you can use additional mechanisms that accelerate the time of rebuilding the tree, but still locked links will remain. The problem with rebuilding the tree is also the loss in rebuilding time: if at 100 Mbit / s and 100 Gb / s at the same time, even the minimum rebuild time will lose significantly more signals (data). Using Per VLAN Spanning Tree or MSTP regions also does not solve our problems:
The very use of VLAN for segmentation for us also becomes a problem of scaling: there can be no more than 4000 of them in one domain. Customers often build hybrid systems: part on their own, part on the cloud. The joints between the customer and the cloud infrastructure - in 95% of cases using VLAN. Couples. And then add a couple more. And we’ll spread one more system to the cloud ... And one more ... VLAN crossing at the entrance is not a problem: you can always “remap” them (remap - change the tag at the junction with the customer). But the number of VLANs and their download by links using STP is a problem.
And there is no particular alternative. You need a large chassis with many ports, so that is enough for everyone! A pair (for reserve)! But we have many different segments that are physically even distributed in different rooms in one data center (this happened for various reasons - both architectural and “historical”), physically connecting all the hosts to this couple will be difficult, and operating is even more difficult . Also, when this couple falls, we lose the entire data center in the worst case scenario: all hosts lose each other. And different segments require different parameters for delays and buffering, which is more difficult to achieve inside the same chassis. Reserving host links to different switches without MLAG cannot be solved in any way if we want backup with VLAN. Thus, tunneling traffic in the VLAN is not enough for our tasks.
Consider IEEE 802.1aq, also known as SPB, and its "killer" - TRILL - open standards, and we all love open standards! But vendors do not really support the network equipment of these protocols - they do not like them. Immediately limit suppliers to 2-3. And after a couple of years, we may lose their support in existing platforms or not even get the same vendors in new platforms. Cisco FabricPath is at least TRILL, but a complete vendor lock.
Something smart, strong and in itself - SDN! Big Switch, Plexxi or OpenFlow based solutions are very beautiful and have undeniable advantages. But we saw in them a complete vendor lock, which for us as a service provider is unacceptable. We abandoned them for this project.
So we came to Leaf & Spine on IP with EVPN and VXLAN.
Here we need to distract ourselves and say that there are 2 options for building Leaf & Spine: L2-factory or L3-factory using overlays to “forward” L2 to the factory (alas, we need overlays, because we do not fully have our application, which can only work on IP). If you build an L2 factory with redundant factory nodes, then you need to use VLANs, the number of which is limited to 4K, while there is no way to use different VLAN domains inside the factory, i.e. it is necessary to ensure their uniqueness. For the Active-Active topology, you must also use Multi-Chassis Link Aggregation is utilized (MC-LAG) on all nodes. L3-factory allows with overlays to achieve greater flexibility, does not depend on the number and uniqueness of VLAN. As an overlay, you can use what is supported in ASIC network equipment (and, of course, in switch software), and now it is MPLS and VXLAN (RFC 7348). VXLAN needs to be “signaled”, for this you can use a static configuration, multicast, MP-BGP EVPN (Multiprotocol Border Gateway Protocol Ethernet Virtual Private Network) or controllers. EVPN was chosen as a more scalable and cheaper option, because it is an open and industrial standard - in theory it will allow to achieve cross-vendor interaction between hardware and software (as, for example, happened with the “simple” or “basic” BGP). The protocol does not introduce a new signaling at the root (unlike SPB or TRILL) and does not impose restrictions on the data plane (there is no requirement for another encapsulation - VXLAN or MPLS is suitable, unlike, for example, Geneve). The protocol also does not require additional features or settings such as multicast. EVPN address family includes information on both layer 2 (MAC) and layer 3 (IP), It also has ARP suppression mechanisms, which, together with the localization of MAC / ARP learning, allows to minimize flooding on the network due to EVPN decentralized: each network device builds its topology based on the data received from the "neighbors". Of course, this is a controversial point: for someone, centralization of management and solutions can be a plus (for example, the presence of controllers a la SDN). EVPN allows you to achieve multihoming LAG: end hosts will “think” that they are connected to the same switch. Given that EVPN did not live another five years, it is unclear how strong his future is. In general, we can change the vendor without changing the protocols. It is important for us that part of the standard is not nailed - it is already the fifth year in draft mode. Some vendors even implemented the recorded standards in their own way, each one in its own volume and in its own scenario. No vendor has a complete set of features of new technology in one place. For example, a different set of types of routes in different directions. The code has not yet been licked to shine due to many years of operation and a large installation base.
Someday the zoo will end
If you look at all our solutions above, you can see that, in principle, all this will someday be decided by one more or less intelligible standard. The problem is that there are many vendors, everyone wants their own, and the network as such is developing slowly. One day, we architects may be replaced with a simple script, but this is still very, very far away.
Therefore, if you want to delve into the undocumented capabilities of hardware and software, learn a lot about what they don’t write in the instructions, and be a free tester for a vendor, go to network architects. So that there wouldn’t be “what idiot did you configure this for?”, You need a team game with business, architects, product experts and operations. For example, if you don’t talk with the operation, which, logically, is not needed in the design process, you may not recognize a number of features that are visible only when you turn the nuts by hand. In general, a team game. In general, a network architect is an architect, a translator, and a person who rebuilds a ship afloat. The only difference is that this is his ship, and if you work with everyone, it will be cosmically cool.
The text was prepared by Oleg Alekseenko, network architect of Technoserv Cloud .