amarao October 17, 2011 at 16:00

Pausing the cloud for new users

From the first, we close the possibility of installing new machines. We have already stopped accepting new customers.

Existing virtual machines of existing customers will be serviced further unchanged. Also, please do not make “cars in reserve” - we stopped accepting new customers not from good circumstances.

The reason is that we crossed the boundaries of calculated capacities, and rewriting architecture “on the go” is a terrible practice. In this regard, it was decided to take a timeout and stop chasing the advertising department (by the way, for this reason we fell silent on Habré - we hoped to slightly reduce the flow of visitors). However, people came - and it came to a funny point, in one of the long and carefully written out components, we laid on the ceiling with about 10k connections. Testing / correction (the preproduction process) was delayed for a month ... And by the time we rolled out this component, it turned out that it was already “in the butt” (6-9k connections per second). But we wrote it for several months!

And it became obvious that we simply could not cope. The decision to stop accepting new customers was not very easy (well, you know, disputes in the style of “why pay your salary?”, Etc.), but common technical sense defeated a healthy ~~greedy~~ aspiration for the company's success.

How much does processing take? The planned period is about 2-3 months, how much is really needed - I do not know. Firstly, because you have to seriously redo the architecture, centralized databases will be permanently deleted; decentralization of everything and everything is an extremely non-trivial task.

With high probability, we will not be able to change the existing configuration, so a second copy of the cloud will be launched. What the migration of clients from the first to the second will look like - again, I don’t know (I haven’t even thought yet).

read error

Now about the accidents. Yes, we are so lucky that there were three incoherent accidents in a row. One on Sunday, the second on Tuesday, the third on Friday. Who's guilty? Well, it depends on who is asking, but in essence - we are. All failures were related to software (not ours); we can’t even nod towards the crooked electricians, cleaners and other scapegoats.

For those who are interested in what it looks like (sorry for the quality, it wasn’t up to high-quality shooting):

Accident 1 - 150 clients:
Uptime at the time of the accident - 4 months 24 days. Since commissioning, the first failure.

Crash 2 - 391 clients:
Uptime - 6 months 4 days (from the moment of the previous accident. Then, due to a bug in the NFS server, I had to force all virtual machines to reboot and ask people to remove the mention of NFS from / etc / fstab).

Crash 3 - 398 customers.
The same repository; Uptime at the time of the accident - 2 days 4 hours.

The elimination of such bottlenecks is the second task that we will solve during the timeout taken.

In our proposed model for storing client data, we did not expect a complete and unconditional termination of the system core. We envisioned a controlled reboot, the fall of specific services, the death of disks in a multiple redundant raid (and even the death of a SAS controller). But there is no such "good".

That was our mistake. And I answer for this error, since I relied on what we could at least find out about the service stopping. In the course of work on the cloud, this will be one of the main tasks that I will work on.

What's the problem?

When an accident occurs, customers begin to make a lot of gestures. Restart machines, try to turn them on and off repeatedly.

Visually, nothing happens, in fact, inside, the system remembers everything. As a result, the task queue for some machines reaches 50-100 tasks. And if we learned to combine the same tasks (if the client asked for a reboot three times, then you only need to restart once), then the different tasks are still performed as they said. Yes, if you said to restart the machine, turn it off, turn it on, restart it, turn it off and on, that is exactly what will be done.

And when there are several hundred such customers ... It turns out unpleasantly. Especially when all requests come almost simultaneously. The pool master didn’t have enough resources in a banal way. That is 800% of the processor load and the queue for several hundred jobs.

But to divide the pool masters into several, we are simply not ready at all. For now. This is one of the tasks with which we will think.

upd: the article was published without my participation, the pictures will be tomorrow .

Tags:

crash

Pausing the cloud for new users

read error

What's the problem?

Also popular now: