Cloud falls

    In preparation for the announcement of ETegro Therascale, our new integrated solution for data centers focused on cloud services (we will try to talk about it in detail in the near future), we became interested in such a moment as cases of the largest cloud services falling. The final set of information seemed so interesting to us that we decided to share it with you. There are no discoveries and secrets in it, moreover, the list does not claim to be as complete as possible, but it may make you think about cloud services.

    We will begin, however, far from the largest, but well-known Habralyudam Selectel. On the evening of September 24, due to problems with communicators, they began to experience a complex malfunction, which lasted 11 hours. We will not give details - they are perfectly set out in the company's blog.

    More recently, a somewhat anecdotal case of Windows Azure has occurred. On August 2, this cloud service was unavailable to users from Western Europe for two and a half hours. The cause of the failure was the safety valve safety mechanism, which was designed to prevent cascading failures in the network structure, which worked incorrectly with increasing capacities.

    In June, Amazon suffered from power problems and regular generator outages. As a result, on the 29th, this resulted in a 20-minute shutdown of the servers and the subsequent one and a half hour recovery of their performance. This affected 7% of instances in one AZ of the US-East-1 region. Among the victims were such well-known companies as Netflix and Instagram. It is interesting to note that as a result of the failure, a bug was found in the ELB, which significantly reduced the speed of transferring the load to other AZs.

    On February 29, Windows Azure was unavailable for approximately 7 hours. The problem in this case was the date that caused the error in the security certificate (well, just the same, “2K problem is striking back”).

    And on January 20, problems at the Equinix data center in the infamous Silicon Valley spoiled the lives of 5 million users of Zoho services for several hours. There was only a few seconds of power in the data center, but fixing the databases took an unmatched time.

    And that's all for this year only. And from 2011 you can immediately recall a lot.

    For example, the problems of August 7 with a 10-kW generator in the Irish data center, erroneously first attributed to a lightning strike, 3 hours disabling Microsoft Business Productivity Online Suite and Amazon EC2 and requiring Amazon to recover more than a day. And the problems that ensued the next day were already in the Region of America because of problems with network channels.

    And the previous 13-hour problems of the same Amazon EC2 in the US-East-1 region are problems with EBS (Elastic Block Storage). Another joke was that it happened on April 21, 2011 - exactly on the day when in one well-known film Skynet declared war on humanity. Artificial intelligence, of course, was nothing to do with, but the instances in Northern Virginia were restored only after 3 days.

    But what are we all about Amazon. In September 2011, with a difference of one day, at first half an hour Google Docs were unavailable, and then almost all Microsoft cloud services crashed for several hours: Skydrive, Hotmail, Office365.

    It is worth recalling about gmail, 0.02% of whose users in the last days of February 2011 found that their mailboxes are empty. Fortunately, there were no losses: the data was restored within 30 hours. But this incident once again reminded the IT world that software errors can even affect several copies of the same data, and backup to tape drives can save even this, due to their operating characteristics.

    And all this is far from a complete list, but only the largest cases. Analyzing the failure statistics, it is easy to see that most cases occur for two reasons: power errors or software errors. I am glad that the hardware that we, in fact, deal with is not mentioned in these "summaries", and everything was done without serious data loss, although, of course, hardly anyone is able to estimate the loss from downtime. However, we deliberately refrain from making our own assessments, and instead ask you a question. And you, personally you and your company, how much do you trust cloud services and are ready to use them, or are you already using them?

