Features support 10 data centers around the world: my experience and rake
This is 2 Petabytes of backup.
We have 14 data centers around the world, of which I service ten. About five years ago, I thought that there, abroad, everything shines, the support is attentive and polite, and it is mistaken only in the smallest detail. My illusions were quickly dispelled.
Here is an example. We stand in our racks servers, in fact - disk shelves designed for "slow" data backups. The place on them ran out. Each server had 24 disks and 36 slots, we decided to finish off another 12 HDDs. I sent tickets, explained what we are doing and why, added that you need to put the discs in unlit slots.
After 10 minutes, monitoring showed that we had a disk in the first server. “Wow, colleagues are burning,” we thought. Probably hurt or something else ... But then almost immediately the second and third discs fell out. I started calling German support, and a colleague from India answered me.
By the time we managed to stop his Greek counterpart, this “terminator" pulled out 12 disks from five servers each and was preparing to proceed to the sixth. The system did a frantic rebuild. When the supporter realized what exactly went wrong, they began to insert disks back into the server. And a little bit, just a little bit, mixed up the order. Which also affected the madness of the rebuild. But, fortunately, thanks to detailed explanations, it was possible to prevent the backup from rolling up - this would interrupt the service for half an hour.
So I found out who exactly and how can work in support. There is also my mistake: I was counting on the usual support of the second line, which understood me in short. And did not take into account cultural and linguistic differences. Since then, we have been writing extremely detailed step-by-step instructions in the spirit of:
- Go to a server like that.
- Make sure that this is the server by checking the number of such and such.
- Count the fourth disc from above.
- Find the eighth disk from below.
- If it is the same drive, carefully remove it.
In general, we proceed from the fact that any place where you can do something wrong or understand something wrong will be exploited as a vulnerability in the code. Colleagues sometimes still give us new unconventional visions of ordinary actions, and we supplement the typical forms. For every normal operation, I already have two-page instructions. It helps a lot in work.
St. Louis, USA
Our first data center is located in the USA, in St. Louis. In it, we placed a cloud backup of users at the very beginning. Considering the popularity of the service at that time and the general misunderstanding that backup should not only be done, but also stored outside the home (Dropbox was only born a year ago, advanced users burned backups almost on discs), we did not think much about architecture and scaling. As it turned out, in vain. The load began to grow faster than we expected, and our PlusServer AG hoster was unable to deliver the iron at the required speed.
In general, we have two types of data centers: where we rent an area (they provide racks, cooling, food and security), or where we actually take a very large colocation (give a section of the hall, internet, support). First, our local engineers work, and secondly, there is no direct access to hardware, and the data center support team does the work. In the case of PlusServer AG, there is a kind of intermediate scenario, and, basically, we use the services of their engineers. I do not recall difficulties or embarrassments with them. Pah-pah ...
Now the St. Louis data center (our section) is half inactive and is waiting for migration - there is a lot of old iron that only testers use.
This is our second data center by age, and there is also a lot of “adult” hardware there, it even seems to be a couple of i3, which testers took from the main infrastructure for “bullying” during crash tests.
All the same, Pluserver, but communication with support is surprisingly difficult. Sometimes it’s very difficult for them to explain something. As you have already read above, if you need to explain something, it takes half an hour for all possible scenarios. Instructions for less than 30 points to restart the server, most likely, will be perceived incorrectly.
On tests of a 10G switch, when we asked to configure the network on a new server, at the time of the ticket execution, the entire data center fell off from monitoring. It turned out that the person setting it up mixed up the gateway and IP servers - and through one of the servers all the others tried to break into the network.
Our third data center is located in Tokyo, the service provider is Equinix, the Internet provider is Level 3. At the start, they could not coordinate cross-connections together. We then needed to get a burstable channel, that is, a line at 10% of the possible disposal now, which we planned to expand 10 times in two years. At the same point of contact (MMR, Meet-me-Room, entry points of provider channels in the data center) there was no connection at all.
Level 3 said that they did everything correctly and do not plan to redo anything. As a result, one and a half weeks, I first found out what exactly was wrong, and then I persuaded both companies to do the right thing, collecting representatives of their various divisions at conference calls. Everyone did the right and exact thing, according to their convictions, and did not want to admit their mistake. Therefore, I asked "to meet us and do a little more." Done.
The best part about working with Japanese support is that they are incredibly executive. This has a flip side: the instructions are almost the same as in Strasbourg. Because they are incredibly pedantic and ask so many questions. Once, they put controllers in one server for 12 hours (!). If at least one situation arises where there are two possible answers, and the support person knows that the first option is 95% correct, he will do it as logically. Almost everywhere. Except Japan. In Japan, he will stop, describe in detail the dilemma and wait patiently for your reply. For example, they always stop the process, if there is more than one free slot inside the server, they ask which one to put into.
Frankfurt am Main, Germany
Equinix is also here, full support from them. The data center was planned as a small auxiliary in the CDN, but grew into a serious site. It was these guys who pulled us 12 disks from the servers on Friday evening.
The checklist of the appeal is as follows:
- Break instructions into short paragraphs.
- Communicate in monosyllabic sentences.
- Try not to create opportunities for decisions "in place", that is, to write down all the options in detail.
Then everything works just fine. I must say that there were no other incidents after the introduction of such rules.
Although no, there was another story. Running out of space, bought immediately a bunch of disks boxes. But not at the local supplier (he didn’t have it so quickly), but in London, and they were taken from the warehouse from the Netherlands. The car arrived on the same day. A letter arrives from the supplier: they say that we delivered the disks, the recipient refused, we are taking back. It turned out that the brave guys from security did not find directly on the boxes to whom these disks were intended. And they turned them back. Since then, we always ask you to correctly sign the boxes if we are transported to the full-service data centers.
By the way, Seagate is very operational - they out of kindness decided to return the disks to the sender’s warehouse as quickly as possible, because the customer was clearly mistaken by the city, and the disks would be urgently needed somewhere in another part of the planet. We caught delivery after the plane: another flight was needed back. Accepted the second time successfully.
The fifth data center was also under full service, only the provider - Softlayer. Here for all the time there was not a single story, not a trace of any misunderstandings. In general, not a single problem, except for the price.
It’s very simple with them: you say what you need, they bill you and provide the infrastructure with iron. They have some of the highest prices, but you can and should bargain - different intermediaries may have different options for the same services, for example. Judging by the responses to the tickets, there are many staff there, but absolutely everyone is competent.
We wanted to make the sixth data center with our engineer. But it turned out that finding a specialist with the required level of competency in Australia was rather difficult: we needed, roughly speaking, a quarter-time freelancer admin who would come a couple of times a month and do the current job. Plus I would go to an accident. Usually we are looking for such candidates through special agencies that provide us with three to four dozen specialists who are already ready for such work. After that, we select up to 10 profiles and conduct interviews on Skype. There remains one person working in the state, and another 1-2 in the wings so that if anything, they can replace him, for example, with an illness.
The problem with the Sydney data center was a server with 72 disks. They required a hell of a lot of power - there were 6 of these servers per rack, and each ate 0.9 kW, the rack - from 6 to 8 kW. My colleague says that if you go after a shower, after 10 minutes your clothes are completely dry.
In London, we combine with Acronis Disaster Recovery. This is the most boring data center, nothing happened there. The year they delivered the iron. So nothing happens. Knock, knock.
Boston has our largest data center. The plans are his move.
In Boston, we experimented with 72-disk servers in 4 units. There were just enough problems, because the server is just magical. There is our guru-admin in the Boston office, but he actually does something else, and every time I call him across the city to replace the drive - somehow it's not very right. Yes, and it’s expensive.
We are writing tickets for local support. But it does not work from the data center, but from a third-party company that provides the racks of this huge data center. And they do not let anyone else on the site: either our guru, or their support. They themselves are able to insert USB-drives into the machine for initialization, change failed drives and reboot the server. All. Once from a 72-disk server it was necessary to pull out specific disks. There are double sleds, two disks one after another. It is difficult to figure out where and what, so sometimes they still touched the wrong drives. I had to go.
At some point after the start, an electrician came running to us with a huge letter. The bottom line was that 7 servers of 0.9 kW per rack is a bit overkill. He says he consumes 115% of the face value, it is necessary to unload. And the other racks were typical - local, where there are two blocks with sockets at the back, where, in fact, plug in the power. Our servers are 20 centimeters longer than usual - and with exactly these 20 centimeters they covered the power slots. We diluted the rack with “short” servers. I remember while playing Tetris, they changed 60-pound cars there, dragged together, improved their health.
We store data of Russians strictly in the Russian Federation. If you are waiting for drop dead stories about our support - do not wait. Although, of course, a couple of surprises from DataPro are in the piggy bank. In the USA, the usual practice of technical work is that a letter arrives in the spirit of “We have planned this and that, this is necessary and useful for the entire data center and for you, the work will be at such and such an interval. It will not bother you. ” In Russia, they inform a little differently, but you yourself probably know. But I must say, there has never been a service interruption.
Before that, we stood in Tver. When transporting iron from Tver to Moscow, over 2 weeks, 15 servers moved to the “hot” one. We wanted to do one iteration, but decided not to risk it with downtime. We drove 2 servers every day: I got up at 6 in the morning, gave the go-ahead for packing, drove in - at 11 in the evening they wrote that they had brought, - they set, checked. They lifted the virtual network between Moscow and Tver on a good link, the servers thought that they were on the same physical network with the same address as before. So they dragged along 2 cars: recovery, rebalance, verification, 2 more servers.
Boston iron is only going to this site, while there is nothing much to tell.
In Ashburn, we are taking everything from Boston according to the scheme worked out on the example of Tver. The 10G link was also raised, and the cars on the same network with the addressing preserved. The idea is that you need to raise the iron in a new place and wait for rebalance: if, for example, half of the disks fall out during transportation, then you need to wait a pretty long time for their restaurant, and not bring the next batch.
In Europe, sometimes there are features with customs: when we needed to urgently change one burned-out server, we sent the car from Boston, and this is 2 days instead of three weeks from the supplier’s warehouse in Europe. But we did not take into account customs officers. There was not enough VAT number, and they could not, due to the language barrier (French), negotiate with the accounting department in the USA. Sent everything back. Since then we have been ordering to Europe from Europe.
In Boston, there was a problem that a 36-disk server fills up in a week and a half, which is 200 terabytes. The order is also from two weeks - and it turns out that we did not have time to order the server in the wake of more than the successful launch of one of the products. Then we decided with new packaging principles and partial distribution among other data centers, we changed a lot in architecture. This affected me because I had to reconfigure the procurement procedures and work with suppliers - since then we can take large batches and faster according to preliminary agreements, pay later.
Once we took on server tests with an incomplete configuration for "slow" data. There was only one processor instead of two, and inside was a disk shelf and a 10GB network card. Turn on - the card does not work. These cards do not see the server. We read the manual - there is a moment in small print: PCI-slots are divided by processors, odd - on the first, even - on the second. All slots work exclusively when there are only two processors, but power comes to everything regardless of this. The card blinks and burns, but its server does not see. They moved it to another slot, although, it seems, the manufacturer had to do this on tests.
At DataPro in Tver, somehow the trees fell into optics: the public network disappeared, they explained what happened by phone. One line now went through exactly a couple of poplars, and the reserve did not work. Network equipment could not switch to the backup channel.
In Germany, Level 3 in 2015 set up their routing and slightly touched our piece of equipment a couple of levels below. For half an hour the connection fell off. At that moment, the European data center was the main one - this led to the termination of service in part of Germany. Since then, we have changed the architecture, but colleagues will tell it better.
In the USA, there was a case. This is probably the funniest thing I've ever seen. Changed the server, called the manufacturer’s engineers to replace the motherboard and power supply. 72 disks, gross weight - 80 kilograms. These bunnies began to pull him out of the rack along with all the entrails. It turned out they only half - he began to roll and fall. They tried to hold him and pull him out, but they bent the slide. They tried to shove it back, but the bent rails did not let them go. Taken and thrown in this condition. They said that they would arrive in a week with a replacement.
In general, as you can see, there was only one conditionally dangerous situation for 5 years when there was a question about rolling backup from another site. But nothing happened. Everything else was solved locally and pretty well. Minor roughness is a common human factor, and it would probably be strange if we had fewer stories.