How does the CROC engineering service work - and what happens if at 3 nights a cluster breaks somewhere far
DL360 is a hot-plug Pentium I server. Somewhere far in Siberia, under constant pressure, his twin brother has been working for many years. If it fails, we have a replacement that allows us to simply continue to work without a major reconfiguration.
But with such a picture, the morning on the road often starts
Good morning! My name is Alexander, I work as the head of the CROC service team.
Throughout the country, there are many facilities where a failure of a cluster immediately leads to the local main being hit on the television. These are various research institutes, industrial enterprises, nodes of banks, insurance, objects of oil companies, airports and so on. And we put hardware, software there and we keep all this for support.
To begin with, there are almost no montages without adventures. Well, if we just forgot to give food or a network. It’s worse when the server rack is outside the building because someone incorrectly indicated the dimensions of the door. There are still moments like: “Guys, we prepared everything, connected it, only there is a nuance - your server was dropped during unloading. Well, just a couple of times. ” Now I will tell and show what our work looks like.
Meaning of work
During my work at CROC, I traveled almost the whole country for installation and support. Now I am in charge of the department, so I travel very rarely.
My work place. As you can see, there are more folders than
pieces of iron. The usual scenario for a combat shift is this: we sit and wait for a call. When something breaks down, we have pretty strict standards on how to fix the breakdown. For example, at critical facilities in Moscow, iron replacement time is 4 hours from handling. In Novosibirsk and other cities, there are also particularly important facilities, since there are no problems with booking tickets now.
The team that is waiting for the call is required to be in place and on duty. As a rule, fighters at this time either pick new iron and study it, or engage in self-training. In general, we train and improve qualifications.
Sometimes we lick on new solutions and order them ourselves “to see”. Many interesting projects come out of this - from the office lighting system, which adjusts to the weather and open windows, to various solutions for our security.
Another part of the engineers is engaged in full-time installation and maintenance. They do not have to break down and run to the terminal or rush to the airport. They know in advance what, where, how and when. It doesn’t get any easier, because, I repeat, each installation is a separate adventure. And to prepare for it is also better to carefully, which in practice is much more nervous work than rushing to the rescue, like Chip and Dale.
Outside of a combat shift, we also work with our hardware, but we can already do this outside the office.Another important aspect is our engineers. These are people with very great practical experience, and some of them often speak both for internal training and at various technical conferences. Except for those fighters who work under the service, of course. Although in theory, if we have several critical situations at once at the same time, a full-time engineer can also interrupt his speech at a half-word and run away. But this was only once in my memory.
Cups are not mine. But they are very good in order, for example, to put all sorts of small parts there, so as not to get lost.
Departure for installation
For example, in the case of a standard installation of a cluster, as a rule, more than one specialist is needed. One is a person who is engaged in OSes and the actual configuration of the cluster, the other is a storagist, and the third is an application, depending on whether the customer puts the butt or not. It happens, when we get along with two, networkers are often in place, but it happens that sometimes there is none of IT at a particular point at all.
Starts with unloading. It happens, they beat the iron. We take pictures when it is necessary to prove a malfunction (for example, that the equipment arrived broken due to the fault of the transport company). Then we figure it out for a long time.
Suppose everything came as it should.We put the system, the same cluster. Everything is fine: there is a specification, equipment, software, we work on customization, there are some agreements between managers. Everything was discussed a hundred times, all the difficult moments from experience were agreed. An engineer arrives, and he realizes that the ideal world is not here.
He approaches, say, a networker and says: “I need to select eight interfaces on the switch.” And they say to him: “I have only six, and two more will be tomorrow or the day after tomorrow. We must order them from the warehouse. ” The engineer runs, asks for something from everyone. When they give him everything, when they poke a place in the rack, they connect electricity, pulled cables to it, a couple of days may pass.
Then he begins to call administrators who register him in the domain, then he calls specialists in DBMS, who begin to tell him how everything is arranged, administrators also enter him into their system. Each time he works with someone new, and not the fact that he is prepared. But the system is combat, and the engineer does not know the passwords, which means that the admin should sit next to him and drive it in for him. They also do not have much fun. And people can be different. For example, the SQL box likes to drink, and someone goes in the shirt with the Simpsons at minus thirty because his wife quit. Everyone needs to find an approach. It is clear that all these people help, because there is a common task, but still there is some kind of fan that you need to learn something from everyone in order to finish your work. Everyone should explain to you how and what is arranged. Very often the documentation is somewhat at odds with reality, and the installation concept may change. Or it suddenly turns out that a certain type of packet on the network is prohibited by Moscow’s policy (and the belt is different, and in Moscow it’s deep night, you won’t call).
At about this stage, it may turn out that there has been a backup for a year, as it was not. Haha And begins, again, a lot of erotic adventures. Of course, we can set without backup, formally, we seem to have nothing to do with it. But then the negative will remain: they say, some kind of arrived ..., broke everything here and left.
We should also say about our warehouse. We have about eighty thousand hot swappable items in stock. It’s clear that when you have a 4 hour SLA to replace, the warehouse must give the piece of iron earlier than you go down the elevator. Therefore, our storekeepers methodically keep accurate records and check everything.
The accounting system says: "Your piece of iron in a box is such and such in a block such and such." Regardless of whether it is small or large.
Approaching - you can immediately see what lies here.
In one of the sections of the warehouse we have a “museum” - a place where these exhibits
are located. They are really working and really needed for hot replacements. When the system is complex, critical, and “don’t touch it while it works,” it’s easier to change the failed node to exactly the same than to reconfigure and redo it. Therefore, we keep reserves worthy of the museum.