The main cause of accidents in data centers is the laying between the computer and the chair
The topic of major accidents in modern data centers raises questions that were not answered in the first article - we decided to develop it.
According to Uptime Institute statistics, most of the incidents in data centers are connected with power supply system failures - they account for 39% of incidents. They are followed by the human factor - this is another 24% of accidents. The third most important (15%) reason was the failure of the air conditioning system, and the fourth place (12%) were natural disasters. The total share of other troubles is only 10%. Without questioning the data of a reputable organization, we highlight something common in different accidents and try to understand whether it was possible to avoid them. Spoiler: possible in most cases.
In simple terms, there are only two problems with power supply: either there is no contact where it should be, or it is where there should not be contact. You can talk for a long time about the reliability of modern uninterruptible power supply systems, but they do not always save. Take for example the sensational case of a data center used by British Airways owned by the parent company International Airlines Group. There are two such facilities near Heathrow Airport - Boadicea House and Comet House. In the first of them, on May 27, 2017, an accidental power outage occurred, which led to an overload and a failure of the UPS system. As a result, part of the IT equipment was physically damaged, and it took three days to resolve the latest accident.
Airlines had to cancel or reschedule more than a thousand flights, about 75 thousand passengers could not fly on time - $ 128 million was spent on compensation, not counting the cost data centers that were needed to restore the functionality. The story of the reasons for the blackout is incomprehensible. If you believe the results of the internal investigation, voiced by the Director General of International Airlines Group, Willie Walsh, it happened due to an engineering error. Nevertheless, the uninterruptible power supply system had to withstand such a shutdown - for this it was mounted. The data center was managed by specialists from the outsourcing company CBRE Managed Services, so British Airways tried to recover the amount of damage through a London court.
Power outages occur according to similar scenarios: first, the outage is due to the fault of the electricity supplier, sometimes due to bad weather or internal problems (including personnel errors), and then the uninterruptible power supply system can’t cope with the load or a short interruption in the sinusoid causes many services to fail, restoration of health which leaves the breakthrough of time and money. Is it possible to avoid such accidents? Of course. If you design the system correctly, however, even the creators of large data centers are not immune from errors.
When the direct cause of an incident is the wrong actions of data center personnel, problems most often (but not always) affect the software part of the IT infrastructure. Such accidents occur even in large corporations. In February 2017, due to an incorrectly typed member of the technical maintenance team of one of the data center teams, some Amazon Web Services servers were disconnected. An error occurred while debugging the billing process for Amazon Simple Storage Service (S3) cloud customers. The employee tried to remove a certain number of virtual servers used by the billing system, but touched a larger cluster.
As a result of the engineer’s error, the servers on which the important Amazon cloud storage software modules were running were deleted. First of all, the indexing subsystem was damaged, containing information about the metadata and the location of all S3 objects in the US region US-EAST-1. The incident also affected the subsystem used to store data and manage the storage space available. After the removal of the virtual machines, these two subsystems required a complete restart, and then Amazon engineers were surprised by the surprise - for a long time the public cloud storage was unable to service client requests.
The effect was widespread, as many large resources use Amazon S3. Malfunctions affected Trello, Coursera, IFTTT and, what is most unpleasant, the services of large Amazon partners from the S&P 500 list. Damage in such cases is not easy to count, but its order was in the region of hundreds of millions of US dollars. As you can see, in order to disable the service of the largest cloud platform, just one wrong team is enough. This is not an isolated case; on May 16, 2019, Yandex. Cloud removed the service during maintenance work.virtual machines of users in the zone ru-central1-c who at least once were in the SUSPENDED status. Here, customer data has already been affected, some of which has been irretrievably lost. Of course, people are imperfect, but modern information security systems have long been able to control the actions of privileged users before executing the commands they enter. If you implement such solutions in Yandex or Amazon, such incidents can be avoided.
In January 2017, a major accident occurred in the Dmitrov data center of Megafon. Then the temperature in the Moscow region dropped to −35 ° C, which led to the failure of the facility's cooling system. The operator’s press service didn’t particularly talk about the causes of the incident - Russian companies are extremely reluctant to talk about accidents at their facilities, in terms of publicity, we are far behind the West. In social networks, there was a version about the freezing of the coolant in the pipes laid along the street and the leak of ethylene glycol. If you believe her, the operation service could not, due to the long holidays, promptly receive 30 tons of coolant and got out using improvised means, organizing impromptu freecooling in violation of the rules for operating the system. Severe cold aggravated the problem - in January winter suddenly happened in Russia, although no one was waiting for her. As a result, the staff had to de-energize part of the server racks, due to which some operator services were unavailable for two days.
Probably, here you can talk about the weather anomaly, but such frosts are not unusual for the capital region. The winter temperature in the Moscow region can drop to lower levels, so data centers are built with the expectation of stable operation at −42 ° С. Most often, cooling systems in cold weather fail because of an insufficiently high concentration of glycols and excess water in the coolant solution. There are problems with the installation of pipes or with miscalculations in the design and testing of the system, associated mainly with the desire to save. As a result, a serious accident happens out of the blue, which could well be prevented.
Most often, thunderstorms and / or hurricanes disrupt the work of the engineering infrastructure of the data center, which leads to a shutdown of services and / or physical damage to equipment. Incidents caused by bad weather occur quite often. In 2012, Hurricane Sandy swept along the western coast of the US with heavy rain. Located in a high-rise building in Lower Manhattan, the Peer 1 data center lost external power after saltwater flooded the basements. The facility’s emergency generators were located on the 18th floor, and their fuel supply was limited - the rules introduced in New York after the 9/11 attacks prohibit storing large amounts of fuel on the upper floors.
The fuel pump also failed, because the staff for several days dragged the diesel for generators manually. The heroism of the team saved the data center from a serious accident, but was it so necessary? We live on a planet with a nitrogen-oxygen atmosphere and plenty of water. Thunderstorms and hurricanes here are commonplace (especially in coastal areas). Designers probably should take into account the risks associated with them and build an appropriate uninterrupted power supply system. Or at least choose a more suitable place for the data center than the high-rise on the island.
All the rest
The Uptime Institute distinguishes various incidents into this category, among which it is difficult to choose a typical one. Theft of copper cables crashing into the data center, power transmission towers and transformer substations cars, fires, excavators spoiling the optics, rodents (rats, rabbits and even wombats, which generally belong to marsupials), as well as amateurs to practice shooting at wires - the menu is extensive . Malfunctions in power supply can even be caused by a theftelectricity illegal marijuana plantation. In most cases, the perpetrators of the incident are specific people, that is, we are again dealing with the human factor when the problem has a name and surname. Even if at first glance the accident is associated with a technical malfunction or natural disasters, it can be avoided if the facility is properly designed and properly operated. The only exceptions are cases of critical damage to the data center infrastructure or the destruction of buildings and structures due to natural disasters. These are really force majeure circumstances, and all other problems are caused by the laying between the computer and the chair - perhaps this is the most unreliable part of any complex system.