Clodo.ru and another mysterious fall
I have the server from 13:30 there is no response from tech support.
Dear clodo.ru employees, where are you? Where is the ticket letter / message / etc your appeal why and how much should we lie?
UPDATE June 15th. Compensation.
Today, one habr user sent a very interesting message about his attempt to receive compensation:
11:55:22 06/14/2011 I write:
VPS 11111-1 does not work for us, for 4 days already.
What is the problem? When will the server work ?!
11:56:55 06/14/2011 Support service (Alexandra M * shuk ***) writes: Has the
server been in the "request processing" state for four days already?
12:01:42 06/14/2011 Support service (Alexandra M * shuk ***) writes:
Check now. By ssh is available, it works.
14:23:25 06/14/2011 I write:
Now available. What was the problem?
14:27:27 06/14/2011 Support service (Alexandra M * shuk ***) writes:
The problem was on our part. We apologize for the inconvenience.
Then I learned about compensation:
According to the Public Offer Agreement as amended on May 23, compensation is paid only upon conclusion of a fault-tolerant service agreement with Clodo. Unfortunately, we are forced to refuse to pay you compensation. We are very sorry for the incident. We did everything so that nothing like this would happen again in the future. If you are interested in a high-availability solution deployed in two data centers, we can provide it within a month.
Claude, Claude, you have everything in his repertoire, anyone anything, we should not have to all have
excuses excuses ... By the way tell your lawyer the law changed the conditions of the provision of your oforty services must be with notice of not less than 5 working days, unless otherwise stated in the contract. Employees clod you yourself is not ashamed to work in such a company?
UPDATE June 14th: 4 days have passed.
Explanations for the accident 06/10/2011
As you know, Clodo is working to restructure the cluster management structure. Given all the sad experience, we lead them at night and breaking them into atomic non-destructive actions.
On the night before the incident, our engineers worked to exclude one of the InfiniBand switches from the cluster. On the one hand, this action was subjected to preliminary testing, and on the other hand, once it was completed, it was once again verified that nothing had been violated. After that, no work was done.
However, after a very long time, the fall of virtual machines began. The problem arose due to the failure of the IP over Infiniband driver (IPoIB) in working with Suse Linux Enterprise Server installed on our XEN-nodes, cluster controller and relays. Unfortunately, the failure was quite fatal and the virtual machines did not rise automatically, but manually. Moreover, emergency start scripts had to be made, so the rise of virtual machines did not happen as fast as we would like. As a result of a failure, a small part (10-15) of virtual machines lost the connection between the virtual machine and the disk. The performance of these machines had to be restored longer.
The failure occurred due to our fault. Its main components:
insufficient testing before the operation (delayed errors were not excluded);
incompletely tested interaction between Suse Linux Enterprise Server and Infiniband;
incomplete startup scripts in case of an accident.
All these errors are a consequence of the human factor. Perpetrators are barred from any action on production servers.
Why did the support service respond so slowly?
The answer is simple: at the time of the accident, the load on the technical support service increases many times. The support service was connected to the elimination of the accident: compiled lists of victims, monitored the launch after troubleshooting, helped the system administrators in every way.
Why during the accident they didn’t communicate with me on the forum / Habrahabr / Twitter and didn’t tell what was happening.
All the people who were able to accurately understand and explain what was happening were engaged in eliminating the consequences. Spending time on people solving the problem was extremely impractical.
Will something like this happen again?
At the moment, all work is suspended. The only thing that is done is the restoration of functionality disabled for the duration of the work. A new regulation of preparation for work has been approved, which includes testing the performance of the entire system as a whole after any planned changes to its nodes.
On my own behalf, I want to apologize to those to whom I did not answer the messages that you sent to my inbox. I do not have time to physically process such a stream of letters.
I want to say on my own behalf, you really Maxim think that this is a worthy explanation ... Although what I’m talking about, how you conduct business, is explained. They themselves have only done worse.
UPDATE 19:45: I have the remaining 2 servers running on different accounts.
UPDATE 19:15: The story continues. I already moved out as the main project, but judging by the comments, there are still problems. At the moment, the Claude forum looks like this:
And the work schedules of one of my servers over the past 6 hours are:
Do you leave and do not even say goodbye?
UPDATE 07:10: more than 18 hours. All servers have risen. It's time to move! Apparently the employees of Claudia have already come to work, slept well and turned on the server :)
UPDATE 04:30: more than 15 hours.Then I would like to see the surnames responsible for the fall. Almost 5 o’clock in the morning, I wait until the servers are raised to pick up the latest current backups. But something tells me that the Claude employees have long gone to sleep at home ...
UPDATE 02:00: 13 hours. Dear Claude employees, it’s just cuddling and disrespect to customers! I hope after this fall all or most of you will leave. For all your downtime on all servers, more money is lost than your entire hosting costs! Moreover, the reputation of one project, which now cannot be restored for any money, was undermined, I made compensations and bonuses as much as I could to my users for your every drop. I can’t imagine what to do now.
Your cloud service costs nothing morewith such uptime. To say that I hate you is to say nothing. Thank you again for your quality and attitude .
UPDATE 23:00: 10 hours lying. No, well this is too much gentlemen. Everything is so fault-tolerant that you cannot raise it for 10 hours.