
Is there stacking on Cisco Nexus switches?

When it comes to Cisco Nexus switches, one of the first questions I am asked is is stacking supported on them? Having heard a negative answer, a logical “Why?” Follows.
The answer is that the switch stack can serve as a single point of failure. At the same time, Nexus is positioned as a switch in the data center, where fault tolerance is one of the first places.
“But you yourself wrote ( part 1 , part 2 , VSS / IRF ) that you can build a fault-tolerant infrastructure on the basis of the stack! It turns out, cheated? ". No way. Each technology is appropriate where its disadvantages are not so critical to the network, and the advantages provide tangible benefits. So with the stack the situation is similar.
Stacking has two main advantages:
- a single management point for all switches (management plane),
- the ability to aggregate channels connected to different devices in the stack (Multi-Chassis Link Aggregation - MC-LAG).
All switches in the stack are configured and maintained through one common interface.
Support for MC-LAG (in terms of Cisco - Port channel) allows you to:
- minimize the use of Spanning Tree Protocols (STP) protocols on the network,
- use aggregated bandwidth (all channels are active),
- provide fail-safe connection of devices (switches, servers, etc.).
On the stack, the MC-LAG is possible due to the common control plane . One of the switches becomes the main (master). It runs a control plane, which coordinates the work of all other devices. By the way, the management plane is activated on it. There is always one “brain” on the stack. The hardware resources of the switches are independent. If one of them breaks down, the rest will continue to work.

Thus, stacking involves a common control plane and management plane. Despite all the advantages, this is a possible point of failure. And although hardware switches are independent, there are failures (not related to hardware) in which the stack may stop functioning correctly. For example, if the control plane on the main switch “freezes” due to a memory leak. The consequences can be different: loss of stack control, termination of various protocols (for example, LACP). In this case, the stack may continue to transmit traffic. After all, ASICs are filled with the necessary data and are practically independent of the operation of the control plane. But all dynamic aggregations (MC-LAGs) will fall apart, as LACP packets will no longer be sent to the neighboring device.
Another possible problem is the situation when several switches decide at once that they are active (“split brain”). Since their configuration is identical, we have two devices with the same addressing on the network. This happens due to a break in the control channel. Of course, there are technologies aimed at combating this phenomenon. In this case, the switches use additional mechanisms for monitoring the status of neighbors. And the control channel on some types of stack is difficult to break. But do not discount such a situation.
Thus, the stack is a good solution for networks where breaking it is not fatal. Yes, it cannot be called a fully fail-safe solution. But the probability of a critical situation is not so great. And it can more than pay off with the benefits that it provides.
Nexus switches are positioned as a solution for environments (primarily data centers) where fault tolerance is very important. Therefore, there is no stacking at all on these devices. I note that the scope of Nexus is not limited only to data centers. They can be used , including, when building a corporate network.
But stacking has significant advantages. Therefore, Nexus'y support a number of technologies that allow you to get them without combining the switches into a stack.
Virtual Port-channel (vPC) technology is used to implement MC-LAG functions. Each Nexus has its own independent control and management plane. In this case, we can aggregate the channels distributed between the two switches. Of course, we do not get the complete independence of the devices. In the process, the switches synchronize with each other the information necessary for aggregation (MAC addresses, ARP and IGMP records, port status). But from the point of view of fault tolerance, it is still better than a single control and management plane. This circuit is more reliable. Even if vPC malfunctions, it will be less fatal to the infrastructure.
However, vPC brings special nuances of work. It can only be configured between two Nexus switches, and both of them must have a set of identical settings. Some functions require small additional settings that are not needed when using the regular stack. For example, correctly routing traffic between two vPC ports implies the presence of a peer-gateway command. Otherwise, you can stumble on a loop prevention mechanism when transmitting traffic through vPC. Dynamic routing through vPC requires a “layer3 peer-router”. It would seem a trifle, but it can spoil the nerves. Not all technologies are compatible in their work with vPC. And it depends quite a lot on the Nexus model. You should carefully look at the configuration guide. In general, as usual, everything has its pros and cons.
vPC +, vPC in ACI
vPC in FabricPath is called vPC +.
In the case of classic vPC synchronization occurs through a dedicated channel peer link. In the case of vPC within the ACI factory, peer link channel is not required. All synchronization takes place through the factory.
In the case of classic vPC synchronization occurs through a dedicated channel peer link. In the case of vPC within the ACI factory, peer link channel is not required. All synchronization takes place through the factory.

In terms of a single point of control for all switches, the absence of stacking is compensated by the following points:
- Use of external extenders Nexus (Fabric Extender - FEX). These are specialized switches in which all the functions of the control / management plane, as well as partially the data plane, are placed on the main (parent) switch.
Since all FEX logic is implemented on the parent Nexus, FEX and the parent switch are a single point of failure. On FEX'ah there is no local switching. Packets between neighboring ports are transmitted through the parent device. So we have an increased load on the channel between them. - Ability to synchronize configuration between two switches (Configuration Synchronization). In this case, the control / management plane remains independent.
Eventually. Nexus has no stack. This is partially offset by other technologies. But you need to use them deliberately, as they introduce certain risks into the network design.
It is worth remembering: for a solution to be fault-tolerant, it should not have dependent parts. Any technologies, protocols that provide fault tolerance can also cause failures. Moreover, thanks to them, problems can pass from one device to another. Nothing is perfect. Therefore, if the issue of fault tolerance is crucial, you need to try to build a network in such a way that the influence of devices on each other is minimal.
But this is a completely different story.
Useful links: