Murphy's Laws in IT
- Transfer
Not so long ago I had a chance to talk with a developer who did not understand why a completely redundant connection between data centers cannot guarantee 100% service availability.
The client had an idyll: L3 communication between sites with two kernel routers on each, with two channels from different operators, presumably leaving the building at different points and then never intersecting anywhere. And despite this, we tell the developers that it is impossible to distribute the components of critical applications in different locations. Amazing, right?
Well, what could go wrong with a fully redundant design?
Each component you use has a non-zero probability of failure. Consequently, the probability of failure of several components at once also exists. But if each component is sufficiently reliable (say, availability is 99.9%), then the likelihood of a simultaneous failure is extremely low, right? Wrong. The nodes connected to each other tend to fall at the same time.
An obvious example: if the kernel router software is vulnerable to a killer package (for example, it causes the device to crash and reboot), the neighboring router will probably receive the same package next.
There was a case: a bug was discovered in Cisco routers, manifested in the processing of exceptionally long AS Path. Then many people first understood the importance of the “bgp maxas-limit” command. The culprits were the Mikrotik routers. The syntax of their configuration is very similar to that of IOS routers, only you need to enter AS Path there, but the number of times that the local AS should be repeated, which some administrators did not know, because why read the documentation? The routers did not check the correctness of the data entered, and as a result, the number in the lower 8 bits of the local AS was taken. For someone, this number was quite high.
A less obvious example. Almost everyone performs work within the same service window. The provider can start preventive work with one of your communication channels at the very moment when you update the router that supports the second channel (yes, I had this, they forgot to warn us about this because they thought that we had everything reserved).
The developer still did not believe me, so I told him another story.
Some time ago we were notified that the data center would be disconnected from external power supply for a couple of hours due to preventive maintenance. Not a problem - there is a diesel generator set, there are capacious UPSs. But when the power was turned off, the diesel engine did not start, and there was no one to repair it in the middle of the night. Fortunately, there was enough time to stop all systems correctly, but an hour later our data center was dead, despite triple redundancy.
To this, the developer replied "now I understand." Then it was already easier to agree on what was needed to ensure the true disaster tolerance of their data centers and services.
Other examples from the comments.
In the comments write about your class “contrary to redundancy” accidents.
The client had an idyll: L3 communication between sites with two kernel routers on each, with two channels from different operators, presumably leaving the building at different points and then never intersecting anywhere. And despite this, we tell the developers that it is impossible to distribute the components of critical applications in different locations. Amazing, right?
Well, what could go wrong with a fully redundant design?
Each component you use has a non-zero probability of failure. Consequently, the probability of failure of several components at once also exists. But if each component is sufficiently reliable (say, availability is 99.9%), then the likelihood of a simultaneous failure is extremely low, right? Wrong. The nodes connected to each other tend to fall at the same time.
An obvious example: if the kernel router software is vulnerable to a killer package (for example, it causes the device to crash and reboot), the neighboring router will probably receive the same package next.
There was a case: a bug was discovered in Cisco routers, manifested in the processing of exceptionally long AS Path. Then many people first understood the importance of the “bgp maxas-limit” command. The culprits were the Mikrotik routers. The syntax of their configuration is very similar to that of IOS routers, only you need to enter AS Path there, but the number of times that the local AS should be repeated, which some administrators did not know, because why read the documentation? The routers did not check the correctness of the data entered, and as a result, the number in the lower 8 bits of the local AS was taken. For someone, this number was quite high.
A less obvious example. Almost everyone performs work within the same service window. The provider can start preventive work with one of your communication channels at the very moment when you update the router that supports the second channel (yes, I had this, they forgot to warn us about this because they thought that we had everything reserved).
The developer still did not believe me, so I told him another story.
Some time ago we were notified that the data center would be disconnected from external power supply for a couple of hours due to preventive maintenance. Not a problem - there is a diesel generator set, there are capacious UPSs. But when the power was turned off, the diesel engine did not start, and there was no one to repair it in the middle of the night. Fortunately, there was enough time to stop all systems correctly, but an hour later our data center was dead, despite triple redundancy.
To this, the developer replied "now I understand." Then it was already easier to agree on what was needed to ensure the true disaster tolerance of their data centers and services.
Other examples from the comments.
Here's another story about fault tolerance. All external channels of one large site were concentrated in one building. Full reservation was provided. Two internal routers, two external, two VPNs, redundant power supply from DGU. DS3 channels were each terminated on their own external router, used separate media converters, and left the building from different directions.
Optics went around the building and eventually converged on the provider's site. There she was stuck in two media converters that turned her back into DS3. Both media converters were lying on a shelf, stuck in a regular home surge protector, which in turn was plugged into a single outlet.
Another story with a generator. At one data center, the inputs were de-energized, and the diesels started up. But someone left a pile of wooden boards on one of them, and an hour later the diesel ignited.
More about diesels, it was 12-13 years ago. I worked in a large British provider (not BT), and one hot day (yes, this happens with us) I did some work in one of our large data centers. I arrived early and found the delivery of a huge container. When I asked what it was, they told me that it was a generator that would give a little extra energy - the cooling systems worked to the limit and there was not enough power. I thought “cool” and set to work.
Late in the morning, a fire alarm went off, and the entire data center was de-energized, only emergency lighting remained to work. I went outside and realized what was happening: the new generator was installed close to the air intakes of the central ventilation system, and when the diesel was started, it spit out a huge cloud of smoke that was sucked into the ventilation, which the smoke sensors inside the building reacted to
when I arrived there the next day , a huge chimney was installed on the diesel exhaust, which diverted smoke far away from the data center building.
That site had a lot of fault tolerance, but in the end nothing helped ...
At EDU, everything was reserved. But one thing we could not fix. The machine room was directly under the toilet of the art department. In general, once we were naturally flooded with shit. Did you know that it is possible to order the departure of their fighters from Sun Microsystems for cleaning storage equipment with cotton swabs with alcohol?
At one of the universities in my city, the main data center is located in the basement of one of their central buildings. They just completed the construction of a neighboring building, and they needed to check the water supply system. During the test period, they opened the drain outside the case, but forgot to close it at night. In the end, all the water drained to the entrance to the basement. The entire basement was flooded with water at 30 centimeters.
Once the client was about to move part of the server hardware to another building - to free up space and add a bit of fault tolerance. Communication was established along two OC-3s from one PKO, but then along two independent routes. We had a detailed plan for moving, every little thing was provided for, and when the time came H - I drowned out the ports, the equipment was turned off and they began to be transferred to another building. The engineer was ready to pull out the optics, the provider was given a green light to cut the now-unused channel. Unless ... Someone somewhere once upon a time, when the circuit was just being put into operation, mixed up channel identifiers. So half of our data center was in the process of transportation from place to place, while the second half cut the only external communication channel. Not very nice.
A few years ago (in the region of 2005-2007), one of the major highways connecting Queensland with the rest of Australia (and at that time with the whole world) broke down. If I remember the sequence of events correctly, it was like that.
The highway went in two different ways, one along the coast, the other through the mainland. At about 3 o’clock in the morning, the line card terminating the coastal optics began to pour in errors, and then completely fell. Not a problem - all traffic was routed through the mainland. The engineers were ordered to arrive at 9 a.m. to replace the board ... But at about 6 a.m. the excavator cut through the optics that were going across the continent.
About 10 years ago (I hope since iron developers wiser) I lost a RAID5 array. On ten discs. It all started with the fact that the disk number "3" flew out. The engineer goes to the array, takes out the disk three - and the array falls. It turned out that the control interface numbered disks from 0 to 9, and the markings on the front panel ranged from 1 to 10, so the engineer took out a working disk.
A large logistics center with all types of redundancy, UPS (batteries and diesel) and everything else. One day, the entire power supply is turned off. It doesn’t matter - the batteries take over the load, the diesels start, the office continues to work.
Power is restored. The quarter is lit up by all the lights, the office is de-energized.
According to Murphy’s laws, the diesel engine was correctly shut off, but only the relay switching the power supply from the diesel engine to the city inputs did not work ...
The data center I used to work in had city power supply and UPS bushings. UPS took with the expectation of 6 hours of operation. In the event of an accident, it was supposed to migrate virtual servers to another site. Not the best solution, but it seems to work.
Once our data center really lost external power, and we found out that the air conditioners were not connected to the UPS. The mash halls instantly overheated, and after half an hour all the systems began to shut down.
In one of the countries of the third world, another incident occurred. When the building was switched off, diesel engines did not start. It turned out that they leaked diesel fuel.
We are customers of a large data center, everything is redundant - battery power, diesel engines, duplicated optics with different paths - idyll. The engine rooms were expanding, and the brave guys broke a couple of walls, previously fencing off the equipment from dust. Then these two clowns came up with the brilliant idea to clean the floor in the hall. Unfortunately, they chose a bucket of water and a rag as in the good old days. Of course, one of them inadvertently turned over a bucket.
In the comments write about your class “contrary to redundancy” accidents.