Cisco VSS: Fear and Loathing at Work
I wrote this post in a fit of indignation and bewilderment at how network equipment from a world leader in this segment can greatly and unexpectedly ruin the life of production processes and us - network admins.
I work for a government organization. The core of our network infrastructure is a VSS pair assembled from two Cisco Catalyst 6509E switches running VS-S720-10G-3C supervisors with IOS version 12.2-33.SXI6 (s72033-adventerprisek9_wan-mz.122-33.SXI6.bin) on board. Our network infrastructure is fully production and should be available almost 24 * 7 * 365. Any preventive work, involving the slightest shutdown of the services provided, we must coordinate in advance and perform at night. Most often these are quarterly nightly prophylaxis. I want to talk about one of these preventive measures. And I sincerely do not want my story to repeat with you.
Immediately I will designate the stuffing of one of the VSS switches of the pair (the stuffing of the second switch is identical):

It would seem that completely routine operations were planned for this prophylaxis. We were allocated a window from 2 a.m. to 8 a.m. This means that at 8 in the morning everything shouldbe agony! work. I will give a chronology of the events of that romantic night:
1. Enabling jumbo frames support on the Cisco catalyst 6509E VSS pair. All commands are taken from the official Cisco manual:

# conf t
# system jumbomtu
# wr mem
The CISCO website says that some line cards do not support jumbo frames at all, or support a limited packet size:

My line cards did not appear on this list. Excellent.
2. Also included was support for jumbo frames on several more stacks (catalyst 3750E, 3750X) looking in VSS. But this, in my opinion, has little to do with the situation described below.
At this stage, everything was normal. The flight is normal. We are going further.
3. Further, according to the plan, there was a scheduled cleaning of catalyst 6509E switches from dust by the vacuum cleaner method. We do this operation regularly (once a quarter) and we did not expect anything unnatural. We decided to start with the switch, which at that time had an active virtual switch member role, in order to at the same time verify the correct execution of switchover. I will call it the commutator (1). So, the switch (1) is turned off. Switchover happened correctly - the second switch (2) reported that it has now become active. Only one packet was lost per ping. Next, Fan Tray and both power supplies were removed and vacuumed. Pushed back. At this stage, monitoring reported that the network is working properly - all the glands are available. Great, I thought, turned on the switch (1) and went to the computer to wait because loading 6509 takes about 7-8 minutes. 10-15-20 minutes pass,(#sho swi vir red) . Turned off the switch (1). Turned on again. Again, about 20 minutes pass - the situation repeats itself. At this time, I did not have a laptop with a console cable nearby to see what was happening. The network is still “flying on one wing” - the switch (2).
Here, the right decision would be to take a laptop, connect to the console of the switch (1) and see what happens there? But, thinking about the conflict of nodes, I decided to pay off the 2nd switch, connect the laptop to the console of the first one and try to turn it on alone. In the console, I saw the following - the switch (1) fell out in rommon. Without a single mistake. Just rommon. Once again turned off - turned on - immediately rommon and that's it. And at 7 a.m. and less than 2 hours later, a working day begins. The situation is heating up. I decided that experimenting and trying to download iOS from rommon wasn’t worth it. Turned off the switch (1). And he set off to turn on the second switch, thinking to himself - well, well, the day we will fly on one wing, and the next night I will deal with the problem. I stuck into the console, turned it on and ...something went wrongin the console, I see how, after a full, it seemed, correct boot, the switch frantically starts to extinguish all ports in turn and go into reboot. This was repeated 3 times. And only the third time, he booted up with grief in half and offered to enter credentials. Logged in. The network seems to be working. Exhaled. But it wasn’t there - the enemies attacked several access switches in monitoring, then they became available, then they went down. Some servers also blinked. Different resources from different VLANs were partially not available. DNS has stopped working correctly in neighboring networks. I did not understand what was happening. The pattern was not traced. The time on the clock is approaching 8 o’clock. I sat down to read the logs, and there beauty is what, lyapotaaa flurry of repeated errors of the form:
% DIAG-SW2_SP-3-MAJOR: Switch 2 Module 5: Online Diagnostics detected a Major Error. Please use 'show diagnostic result' to see test results.
% DIAG-SW2_SP-3-MINOR: Switch 2 Module 2: Online Diagnostics detected a Minor Error. Please use 'show diagnostic result' to see test results.
Here, Module 5 is the SUP-720-10G-3C supervisor, and Module 2 is the WS-X6724-SFP line card.
Immediately to the company with which we have a service contract for service, we sent all the logs, as well as show tech-support and show diagnostic . Those, in turn, opened a case in Cisco TAC. And after 3 hours, a response came from Cisco TAC that BOTH (!!!) of the supervisor and the line card are _appartially_ faulty and must be replaced.Bingo!After analyzing the logs, Cisco TAC engineers reported that the supervisors were already faulty before the reboot. And during the reboot due to malfunctions, they could not pass Self tests and resume correct operation. We were not answered to our question about how we could previously find out about the malfunction of supervisors and, accordingly, knowing this, not restarting them.
Here you have the fault tolerance when 2 pieces of iron costing $ 42,000 each (the price from the autumn Cisco GPL) die at the same time when the chassis is restarted. And this, apart from the dead line card.
An engineer came to us with one (because at that moment they had no more available) with a substitute soup and a line card.
Meanwhile, after analyzing the situation, we realized that customers who are experiencing problems with the network are stuck in this line card. Clients were transferred to other ports. Until the end of the day, somehow flew. After the end of the working day, the config, vlan.dat and the new version of IOS 12.2 (33) SXJ6 were uploaded to the new replacement SUP-720-10G-3СXL using a flash drive. And it was decided to start with the chassis (1) - the one that was currently turned off. We replaced the soup, pulled out all the transceivers (just in case), brought in the chassis - there are no errors. After that, they extinguished the working chassis (2) and inserted the transceivers into the chassis (1) - the network came to life on one fixed wing with the temporarily granted SUP-720-10G-3CXL.
Please note that VSS does not start between two different soups: SUP-720-10G-3CXL and SUP-720-10G-3C swearing at:
02:16:21 MSK:% PFREDUN-SW1_SP-4-PFC_MISMATCH: Active PFC is PFC3CXL and Other PFC PFC3C
Therefore, this was a temporary crutch solution until two identical SUP-720-10G-3C were brought to us.
Since leaving the production network for a long time on “one wing” is not an option at all, in the near future another night window was agreed to replace the soup and the line card in the chassis (2). At this point, 2 new SUP-720-10G-3C and a WS-X6708-10GE line card were brought to the site. According to the old scenario, firmware, vlan.dat and a fresh config from an already running chassis were uploaded to new soups (1). From the chassis (2) all transceivers were disconnected from sin away. We made sure that the chassis (2) on the new hardware is loaded without errors, turned off. Commuted VSL links, included. The flight is normal. VSS is going. Stuck transceivers and the network healed on both wings. Joy knew no bounds. HURRAH! This nightmare has finally ended.
But the joy did not last long. A few days after these works, at the height of the working day, the VSL link between soups suddenly goes out. What the hell is this? On new equipment? Fortunately, the second VSL link (Port-channel LACP of two ports for the total VSL link between the chassis is assembled on each chassis) between the line cards of the cores and the Dual Active Detection passed us by. By replacing 10Gbase-LRM transceivers and swapping to adjacent slots, it was revealed that the slot in the supervisor chassis (1) died - AHTUNG!
In the logs:

Next, the procedure for opening a case in Cisco TAC and the subsequent response of the CISCO engineer:

For those who have difficulty with English, I will briefly describe his answer. The Cisco TAC engineer suggests enabling the diagnostic bootup level complete, “tearing” the supervisor into the chassis and observing the development of events. If this is due to the Faulty Fabric Channel on Supervisor Module or the Line cards (with a faulty switching factory channel on the supervisor or line card), then the problem will appear soon after a fairly short time. If the problem does not appear after 48 hours, then it can be considered “... termed to be a one time” ie disposable.
Now we are coordinating the next window for performing these works. I will write here an update on the results.
I hope that this article will help someone in a difficult situation or warn against it. I will answer all questions with pleasure in the comments.
All strongnerve supervisors in our sometimes breathtaking profession :)
UPD: We pulled soup out of the active chassis last night (according to the above recommendation of the Cisco TAC engineer), one ping was lost, the second chassis became active. Put the soup back. We checked the allegedly faulty slot for 10Gbase-LRM - it works. We fly further.
I work for a government organization. The core of our network infrastructure is a VSS pair assembled from two Cisco Catalyst 6509E switches running VS-S720-10G-3C supervisors with IOS version 12.2-33.SXI6 (s72033-adventerprisek9_wan-mz.122-33.SXI6.bin) on board. Our network infrastructure is fully production and should be available almost 24 * 7 * 365. Any preventive work, involving the slightest shutdown of the services provided, we must coordinate in advance and perform at night. Most often these are quarterly nightly prophylaxis. I want to talk about one of these preventive measures. And I sincerely do not want my story to repeat with you.
Immediately I will designate the stuffing of one of the VSS switches of the pair (the stuffing of the second switch is identical):

It would seem that completely routine operations were planned for this prophylaxis. We were allocated a window from 2 a.m. to 8 a.m. This means that at 8 in the morning everything should
1. Enabling jumbo frames support on the Cisco catalyst 6509E VSS pair. All commands are taken from the official Cisco manual:

# conf t
# system jumbomtu
# wr mem
The CISCO website says that some line cards do not support jumbo frames at all, or support a limited packet size:

My line cards did not appear on this list. Excellent.
2. Also included was support for jumbo frames on several more stacks (catalyst 3750E, 3750X) looking in VSS. But this, in my opinion, has little to do with the situation described below.
At this stage, everything was normal. The flight is normal. We are going further.
3. Further, according to the plan, there was a scheduled cleaning of catalyst 6509E switches from dust by the vacuum cleaner method. We do this operation regularly (once a quarter) and we did not expect anything unnatural. We decided to start with the switch, which at that time had an active virtual switch member role, in order to at the same time verify the correct execution of switchover. I will call it the commutator (1). So, the switch (1) is turned off. Switchover happened correctly - the second switch (2) reported that it has now become active. Only one packet was lost per ping. Next, Fan Tray and both power supplies were removed and vacuumed. Pushed back. At this stage, monitoring reported that the network is working properly - all the glands are available. Great, I thought, turned on the switch (1) and went to the computer to wait because loading 6509 takes about 7-8 minutes. 10-15-20 minutes pass,(#sho swi vir red) . Turned off the switch (1). Turned on again. Again, about 20 minutes pass - the situation repeats itself. At this time, I did not have a laptop with a console cable nearby to see what was happening. The network is still “flying on one wing” - the switch (2).
Here, the right decision would be to take a laptop, connect to the console of the switch (1) and see what happens there? But, thinking about the conflict of nodes, I decided to pay off the 2nd switch, connect the laptop to the console of the first one and try to turn it on alone. In the console, I saw the following - the switch (1) fell out in rommon. Without a single mistake. Just rommon. Once again turned off - turned on - immediately rommon and that's it. And at 7 a.m. and less than 2 hours later, a working day begins. The situation is heating up. I decided that experimenting and trying to download iOS from rommon wasn’t worth it. Turned off the switch (1). And he set off to turn on the second switch, thinking to himself - well, well, the day we will fly on one wing, and the next night I will deal with the problem. I stuck into the console, turned it on and ...
% DIAG-SW2_SP-3-MAJOR: Switch 2 Module 5: Online Diagnostics detected a Major Error. Please use 'show diagnostic result' to see test results.
% DIAG-SW2_SP-3-MINOR: Switch 2 Module 2: Online Diagnostics detected a Minor Error. Please use 'show diagnostic result' to see test results.
Here, Module 5 is the SUP-720-10G-3C supervisor, and Module 2 is the WS-X6724-SFP line card.
Immediately to the company with which we have a service contract for service, we sent all the logs, as well as show tech-support and show diagnostic . Those, in turn, opened a case in Cisco TAC. And after 3 hours, a response came from Cisco TAC that BOTH (!!!) of the supervisor and the line card are _appartially_ faulty and must be replaced.
Here you have the fault tolerance when 2 pieces of iron costing $ 42,000 each (the price from the autumn Cisco GPL) die at the same time when the chassis is restarted. And this, apart from the dead line card.
An engineer came to us with one (because at that moment they had no more available) with a substitute soup and a line card.
Meanwhile, after analyzing the situation, we realized that customers who are experiencing problems with the network are stuck in this line card. Clients were transferred to other ports. Until the end of the day, somehow flew. After the end of the working day, the config, vlan.dat and the new version of IOS 12.2 (33) SXJ6 were uploaded to the new replacement SUP-720-10G-3СXL using a flash drive. And it was decided to start with the chassis (1) - the one that was currently turned off. We replaced the soup, pulled out all the transceivers (just in case), brought in the chassis - there are no errors. After that, they extinguished the working chassis (2) and inserted the transceivers into the chassis (1) - the network came to life on one fixed wing with the temporarily granted SUP-720-10G-3CXL.
Please note that VSS does not start between two different soups: SUP-720-10G-3CXL and SUP-720-10G-3C swearing at:
02:16:21 MSK:% PFREDUN-SW1_SP-4-PFC_MISMATCH: Active PFC is PFC3CXL and Other PFC PFC3C
Therefore, this was a temporary crutch solution until two identical SUP-720-10G-3C were brought to us.
Since leaving the production network for a long time on “one wing” is not an option at all, in the near future another night window was agreed to replace the soup and the line card in the chassis (2). At this point, 2 new SUP-720-10G-3C and a WS-X6708-10GE line card were brought to the site. According to the old scenario, firmware, vlan.dat and a fresh config from an already running chassis were uploaded to new soups (1). From the chassis (2) all transceivers were disconnected from sin away. We made sure that the chassis (2) on the new hardware is loaded without errors, turned off. Commuted VSL links, included. The flight is normal. VSS is going. Stuck transceivers and the network healed on both wings. Joy knew no bounds. HURRAH! This nightmare has finally ended.
But the joy did not last long. A few days after these works, at the height of the working day, the VSL link between soups suddenly goes out. What the hell is this? On new equipment? Fortunately, the second VSL link (Port-channel LACP of two ports for the total VSL link between the chassis is assembled on each chassis) between the line cards of the cores and the Dual Active Detection passed us by. By replacing 10Gbase-LRM transceivers and swapping to adjacent slots, it was revealed that the slot in the supervisor chassis (1) died - AHTUNG!
In the logs:

Next, the procedure for opening a case in Cisco TAC and the subsequent response of the CISCO engineer:

For those who have difficulty with English, I will briefly describe his answer. The Cisco TAC engineer suggests enabling the diagnostic bootup level complete, “tearing” the supervisor into the chassis and observing the development of events. If this is due to the Faulty Fabric Channel on Supervisor Module or the Line cards (with a faulty switching factory channel on the supervisor or line card), then the problem will appear soon after a fairly short time. If the problem does not appear after 48 hours, then it can be considered “... termed to be a one time” ie disposable.
Now we are coordinating the next window for performing these works. I will write here an update on the results.
I hope that this article will help someone in a difficult situation or warn against it. I will answer all questions with pleasure in the comments.
All strong
UPD: We pulled soup out of the active chassis last night (according to the above recommendation of the Cisco TAC engineer), one ping was lost, the second chassis became active. Put the soup back. We checked the allegedly faulty slot for 10Gbase-LRM - it works. We fly further.