Problems in the work of services September 24-25
First of all, we want to apologize for the biggest downtime in the history of Selectel. Below we will try to restore in detail the chronology of events, talk about what has been done to prevent such situations in the future, as well as about compensations for customers affected by these malfunctions.
The problems began on Monday evening, September 24 ( downtime 22:00 - 23:10 ). From the outside, it looked like a loss of connectivity with the St. Petersburg segment of the network. This failure caused problems in all of our Internet services in St. Petersburg; The Moscow network segment, as well as the local server ports, continued to operate. Also, DNS (ns1.selectel.org and ns2.selectel.org), which are located in St. Petersburg, were unavailable, the Moscow DNS (ns3.selectel.org) did not hurt this failure. Due to the lack of connectivity, access to the site and the control panel disappeared, the main load fell on telephony, in connection with which many customers could not wait for an answer.
When analyzing the situation, it was immediately possible to establish that the problem was caused by incorrect operation of the aggregation level switches, which are two Juniper EX4500, combined into a single virtual chassis. Visually, everything looked quite functional, but when connected to the console, a lot of messages were found, which, however, did not allow to establish the exact cause of the problems.
Sep 24 22:02:02 chassism : CM_TSUNAMI: i2c read on register (56) failed Sep 24 22:02:02 chassism : cm_read_i2c errno: 16, device: 292
In fact, all optical 10G Ethernet ports on the aggregation level switch chassis stopped working.
Sep 24 22:01:49 chassisd : CHASSISD_IFDEV_DETACH_PIC: ifdev_detach_pic (0/3) Sep 24 22:01:49 craftd : Minor alarm set, FPC 0 PEM 0 Removed Sep 24 22:01:49 craftd : Minor alarm set, FPC 0 PEM 1 Removed
After the reboot, everything worked stably. Since the network configuration had not changed for a long time, and no work was done before the accident, we considered this a one-time problem. Unfortunately, after only 45 minutes, these same switches again stopped responding, and were rebooted again ( 23:55 - 00:05 ).
Lower switch priority in virtual chassis
Since in both cases one of the two switches in the virtual chassis was the first to fail, and the second only stopped working after it, the assumption was made that the problem was in it. The virtual chassis was reconfigured so that the second switch became the main one, and the other remained only as a reserve. In between operations, the switches were rebooted again ( 00:40 - 00:55 )
The virtual chassis is disassembled, all links are transferred to one switch
After about an hour, another failure showed that the actions performed were not enough. After releasing and sealing part of the port capacity, we decided to completely disconnect the failed device from the virtual chassis and transfer all the links to a “healthy” switch. By about 4:30 a.m. this was done ( 02:28 - 03:01, 03:51 - 04:30 ).
Replacing the switch with a spare
However, after an hour, this switch also stopped working. While he was still working, the exact same completely new switch was taken from the reserve, installed and configured. All traffic was transferred to it. Connectivity has appeared - the network has earned ( 05:30 - 06:05 )
After 3 hours, at about 9 am, everything repeated again. We decided to install a different version of the operating system (JunOS) on the switch. After the update, everything worked ( 08:44 - 09:01 )
Breakage of fibers between data centers
Closer to 12:00, all cloud servers were started. But at 12:45, there was a damage to the optical signal in the cable, which connected the network segments in different data centers. At this point, due to the decommissioning of one of the two backbone switches, the network worked along only one main route, the backup was disconnected. This led to a loss of connectivity in the cloud between the host machines and the storage system (data storage network), as well as to the inaccessibility of servers located in one of the St. Petersburg data centers.
After the emergency brigade left for the place where the cable was damaged, it turned out that the cable was fired from the air rifle by hooligans, who were detained and transferred to the police.
Our obvious action was switching to the second channel, without waiting for the restoration of the fibers through the first channel. This was done quickly enough, but it just worked, as the switch hung again. ( 12:45 - 13:05 )
Optical SFP + Transceivers
This time, in the new JunOS version, intelligible messages appeared in the logs and managed to find a complaint about the inability to read the service information of one of the SFP + modules,
Sep 25 13:01:06 chassism : CM_TSUNAMI [FPC: 0 PIC: 0 Port: 18]: Failed to read SFP + ID EEPROM Sep 25 13:01:06 chassism : xcvr_cache_eeprom: xcvr_read_eeprom failed - link: 18 pic_slot: 0
After removing this module, the network recovered. We assumed that the problem was in this transceiver and the reaction to it from the side of the switch, since this transceiver visited each of the 3 switches that we successively replaced before.
However, after 3 hours the situation repeated again. This time, the messages did not indicate a failed module, we immediately decided to replace all the transvers with new ones from the reserve, but this did not help either. They started to watch all the transceivers in turn, pulling one at a time, another problem transceiver was found already from the new batch. After making sure that the problem with the switches was resolved, we cross-routed the internal network connections to switch to the main operation scheme ( 16:07 - 16:31, 17:39 - 18:04, 18:22 - 18:27 )
Cloud Server Recovery
Since the scale of the problem was initially unclear, we tried several times to raise the cloud servers. The machines located on the new storage (the beginning of uuids for SR: d7e ... and e9f ...) survived the first crashes only as the inaccessibility of the Internet. Cloud servers on old storages, alas, received an I / O Error for disks. At the same time, very old virtual machines switched to read only mode. In this case, newer machines have the error = panic setting in fstab, which terminates the machine in case of an error. After several restarts, unfortunately, there was a situation where preparing hosts for VM launch took an unacceptably long time (massive IO error for LVM is rather unpleasant; in some cases, a dying virtual machine turns into a zombie, and their capture and completion requires manual work each time) . It was decided to restart the power hosts. This caused a reboot for virtual machines from new repositories, which we really did not want to do, but allowed us to significantly (at least three times) reduce the startup time of all the others. At the same time, the storages themselves were without network activity and with intact data.
Despite the fact that there was a reserve of equipment in the data centers, the network was built with redundancy, as well as a number of other factors ensuring stability and uninterrupted operation, the current situation was unexpected for us.
As a result, it was decided to carry out the following activities:
- Enhanced verification of optical transceivers and network equipment in a test environment;
- Armored Kevlar fiber optic cable in places with the risk of damage as a result of hooligan actions;
- Accelerate the completion of modernization of cloud server infrastructure.
A question that interests everyone. The table below shows the amount of compensation for various types of services as a percentage of the cost of the provided services for the month, in accordance with the SLA.
Despite the fact that formally the downtime of services was less than indicated in the table (connectivity sometimes appeared and disappeared), it was decided to round off downtime in a larger direction.
|Virtual dedicated server||11 o'clock||thirty%|
|Dedicated server / arbitrary configuration server||11 o'clock||thirty%|
|Equipment placement||11 o'clock||thirty%|
|CMS Hosting||11 o'clock||thirty%|
|Cloud servers||24 hours||fifty%|
|Cloud storage||11 o'clock||fifty%|
Once again, we apologize to everyone who was hurt by this incident. We are well aware of the negative impact of network inaccessibility for customers, but we could not somehow speed up the solution of problems because the situation was non-standard. We took all possible actions to resolve the problems as quickly as possible, but unfortunately, we could not establish the exact problem analytically and we had to look for it by enumerating all the possible options, which in turn took a lot of time.