itNews February 17, 2018 at 08:38

How to manage infrastructure in hundreds of servers and avoid problems: 5 tips from King Servers engineers

We write a lot in the blog on Habré about building IT infrastructure - for example, we reveal issues of choosing data centers in Russia and the USA . Nowadays, hundreds of physical and thousands of virtual servers work within King Servers. Today, our engineers share tips for managing infrastructure of this size.

Automation is very important.

One of the most important elements of success for managing infrastructure of this scale is well-established monitoring. Since manually monitoring such a number of servers is physically unrealistic even with a large staff of engineers, it is necessary to use automation. The system itself must find possible problems and notify about them. For such monitoring to be as effective as possible, all possible points of failure must be foreseen in advance. Problems should be identified at the initial stages in an automatic mode, so that the engineers have time to take the necessary actions in the normal mode and not restore the already dropped services.

It is also important to automate routine activities as much as possible. Installing support keys on 10 servers is one task, and when you need to cover three hundred, it’s completely different.

It is important to understand that infrastructure tends to increase in size over time. However, usually the number of engineers in a company does not increase at the corresponding pace. To avoid problems, you need to implement automation immediately, even if at the current moment it seems that you can cope with your hands. A mistake is very common when it seems that if an action is repeated once a week, then there is no need to automate it. In fact, the time spent writing a script will pay off in the near future by reducing the time it takes to perform repetitive actions. Also, with this approach, over time, a database of scripts will be built that is targeted specifically to your infrastructure and it will be possible to use ready-made code fragments in new scripts, which will also save time.

And, of course, more than one engineer should receive all automatically generated notifications. Even with sophisticated automation, effectively administering hundreds of services, a very small team of one or two engineers will not be able to. The right option would be the continued presence of one of the engineers online. In our experience, this requires a minimum of 4 engineers.

Reservation and backups save in critical situations

In the case of large infrastructure, redundancy becomes even more important, because in the event of a failure a large number of users will suffer. It is worth to reserve:

Server monitoring - it is necessary not only to create a failure monitoring system, but also to use the monitoring monitoring tools.
Management tools.
Communication channels - if there is only one provider in your data center, then in the event of a failure all your hardware will be completely cut off from the world. Recently, we faced channel failures in our data center in the Netherlands - if there weren’t a reservation for the channel, the service for hundreds of our customers would be stopped, with corresponding consequences for the business.
Power lines - in modern data centers it is possible to bring 2 independent power lines to the racks. Do not neglect this and save on additional power supplies for servers. If you have already bought many servers where it is impossible to install a spare power supply unit, you can install equipment for automatic input of the reserve, which is designed to connect equipment with one power supply unit to 2 inputs.
Any equipment requires periodic maintenance, and the likelihood that the data center will work with the shutdown of one of the power lines is 100%
If you are lucky, the work will not be on your power line, but it’s not worth the risk.
Not only can services be unavailable for several hours, there is also a possibility that some of the servers will not rise after a sudden power outage, and the intervention of engineers will be required.
Iron - even in the case of high-quality equipment, there is always a chance of failure, so the most important elements should be duplicated at the hardware level. For example, disks often fail in servers, so the use of redundant RAID arrays is a good idea, as is network-level redundancy when one server is connected to different switches, which allows you not to lose traffic when one device fails. It is also necessary to have at least a minimum stock of spare parts. Failure of components of high-quality equipment is unlikely, but not impossible.

Regular backups and testing, and deploying backups are also an absolutely essential part of managing any infrastructure. When you need to back up a lot of data and make backups often, it is important not to forget to monitor the performance of the server on which the backup is saved. It may well turn out that there is already too much data, so the backup has not had time to complete yet, and a new one is already starting.

Drawing up documentation and logs

When a project is supported by several engineers, it is very important to document work processes as the infrastructure modernizes. From the very beginning, you should create your own knowledge base and, together with the introduction of new options, draw up documentation on working with them. Even if all those who support the project already know what and how it works. With an expanding team of engineers, well-written documentation will help greatly speed up the process of introducing a new team member to the project.

Do not forget that scripting is also a development, and when working on a script, several people need to get a version control system and track changes so that if problems arise, you can quickly roll back to the previous version.

When several people are working on the same task, it is important to convey information about what has been done. Significantly reduce the time maintaining detailed logs of work in automatic mode and storing them on a central server. This will simplify the transfer of the task to another person (there is no need to specify what actions were performed, everything will be in the log file).

It is important to analyze the response time of data center engineers

All adequate modern data centers provide the remote hands service, in which, if problems arise, the server owner may ask the engineers on the site to perform the necessary actions with the equipment. But not always in the end, the sites render it qualitatively. There can be many reasons, one of the most frequent is the high workload of specialists, or the lack of some engineers at the site (one of the data centers in the USA offered us such an option - leaving the engineer during the working day), but in critical situations the reaction time is very and a problem may arise not only during working hours.

Therefore, you should not immediately transfer the entire infrastructure to one data center, it is better to carry out the move in stages, this will allow you to identify possible problems until the moment when it is already too difficult and expensive to take any measures.

It may also turn out that after you transported the equipment and neatly laid all the cables, after a couple of years and several dozen requests for help within the framework of the remote hands service, the picture will look like this:

And during the work problems will arise when due to the abundance cables, one server cannot be removed from the rack without disrupting the operation of other servers. It is important to periodically check the condition of the racks and ask the data center staff to take photos.

No need to strive for savings, experimenting better carefully

It is necessary to carefully study the characteristics of the equipment - this will more accurately predict possible problems. For example, in the case of SSD, a little extra time spent analyzing can get iron, which will live much longer than bought in a hurry.

It should be prepared for the fact that such an approach will not allow saving here and now. In the long run, saving on iron results in losses - a lower price is always compensated by low reliability, and repair and replacement of iron in the end are often more expensive than buying more expensive equipment that will live longer.

In our project we have been using SuperMicro products for many years - there were no serious causeless failures with these servers. Also, not so long ago, they began to work very carefully with Dell equipment. The products of this company have an excellent reputation, but it is always risky to dramatically increase the volume of work with new iron, so we are moving gradually.

As for the components, the experiments here are also inappropriate - it is better to be paranoid and trust only trusted suppliers, even if this does not allow saving here and now. In the same way, it’s categorically not even necessary to use components for desktop computers in the server — they are completely different classes of devices designed for radically different loads.

This will not end in anything good, especially if external customers use the infrastructure - in the hosting industry, it is not uncommon for a user to buy a virtual server for business, but forgets to backup or does not test their deployment (this is how small businesses save on administrators). If at the same time the physical server is also assembled half of the desktop components, then very soon there will be a failure with a complete loss of data - this can just kill someone else’s business. This should not be allowed.