Major accidents in data centers: causes and effects

    Modern data centers are reliable, but any equipment breaks from time to time. In a short note, we collected the most significant incidents of 2018.



    The impact of digital technologies on the economy is growing, volumes of processed information are increasing, new facilities are being built, and this is good, while everything works. Unfortunately, the impact of disruptions in data centers on the economy has also increased since people began to place business critical IT infrastructure in them - this is the inevitable consequence of digitalization. We publish a small selection of the most noticeable accidents that occurred in different countries last year.



    USA


    This country is a recognized leader in the field of data center construction. The United States has the largest number of large commercial and corporate data centers serving global services, and therefore the consequences of incidents in them are most significant. At the beginning of March, due to a powerful cyclone, four Equinix operator facilities faced power outages. The areas were used for Amazon Web Services (AWS) equipment, the accident led to the inaccessibility of many popular services: GitHub, MongoDB, NewVoiceMedia, Slack, Zillow, Atlassian, Twilio and mCapital One, as well as Amazon Alexa virtual assistant were injured.


    In September, weather anomalies hit the Microsoft data centers located in Texas; then, due to a thunderstorm, the power supply system of the whole region was disrupted, and in the data center switched to power from the DGU, it is not known why the cooling turned off. It took several days to eliminate the consequences of the accident, and although due to load balancing this failure was not critical, users around the world noticed a slight slowdown in Microsoft cloud services.


    Russia


    The most serious accident occurred on August 20 in one of Rostelecom's data centers. Because of it, the servers of the Unified State Register of Real Estate stopped for 66 hours, and therefore they had to be transferred to the backup site. Rosreestr was only able to restore the processing of applications received through all channels on September 3 - the state organization is trying to recover a large sum from Rostelecom for violating the service level agreement.


    On February 16, due to problems in the networks of Lenenergo, a backup power supply system was turned on in the data center of the Xelnet company (St. Petersburg). A short interruption of the sinusoid led to disruptions in the work of many services: in particular, the large cloud provider 1cloud suffered, but the most noticeable problem for the Russian Internet audience was the inability to access the VKontakte social network site. The most interesting thing is that it took about 12 hours to completely eliminate the consequences of a short-term power failure.


    The European Union


    In the EU in 2018, several serious incidents were recorded. In March, a failure occurred in the data center of the KLM air carrier: the power supply was cut off for 10 minutes, and the capacity of the diesel generator sets was insufficient for the equipment to operate. Some of the servers were disconnected, and the airlines had to cancel or postpone several dozen flights.


    This is not the only accident related to air transportation - already in April, a failure occurred in the power supply system of the Eurocontrol data center. The organization controls the movement of aircraft in the European Union, and while specialists eliminated the consequences of the accident for 5 hours, passengers again had to endure delays and rescheduled flights.


    Very serious problems arise due to accidents in data centers serving the financial sector. The cost of interruptions in conducting transactions here is usually high, and the level of reliability of the objects is appropriate, but this does not save from incidents. On April 18, the Nordic NASDAQ Stock Exchange (Helsinki, Finland) was unable to bid throughout Northern Europe during the day due to the unauthorized launch of a gas fire extinguishing system in the commercial data center DigiPlex, which was accidentally de-energized.


    On June 7, interruptions in the operation of the data center forced the London Stock Exchange (London Stock Exchange, LSE) to delay the start of trading by an hour. In addition, in June in Europe, due to a malfunction in the data center, the services of the international payment system VISA were disconnected for the whole day, and the details of the incident were not disclosed.


    Japan


    In the summer of 2018, a fire broke out at the underground levels of the Amazon data center under construction in Tokyo, in which 5 workers died and at least 50 were injured. The fire damaged about 5,000 m 2 of the facility’s premises. The investigation showed that the human factor became the cause of the fire: due to careless handling of acetylene burners, the insulation ignited.


    Causes of Failures


    The above list of incidents is far from complete, because of accidents in data centers, customers of banks and telecom operators suffer, go to offline services of cloud providers and even the work of emergency services is disrupted. A small interruption in service can lead to serious losses, while, according to the Uptime Institute, the majority of failures (39%) are associated with the power supply system. In second place (24%) is the human factor, and in third (15%) is the air conditioning system. Only 12% of accidents in data centers can be attributed to the share of natural phenomena, and only 10% of them occur for reasons other than those listed.


    Despite the strict standards of reliability and safety, not a single object is insured against incidents. Most of them are due to power failures or personnel errors. These two factors should first of all be paid attention to the owners of data centers and server rooms, and customers should understand: even market leaders cannot guarantee absolute reliability. If equipment or a cloud service serves business critical processes, you should consider a backup site.


    Photo source: telecombloger.ru


    Also popular now: