Data center operation: what you need to do yourself
I check on the checklist the maintenance of the UPS carried out by the contractor.
Hello, Habr! My name is Cyril of Shad. Now I design and build data centers and server. Prior to that, he headed the DataLine data center operation service for a long time (at that time about 3,000 racks). Together with my team, I passed an Uptime audit on Management and Operations with a score of 92 out of 100 possible, and also with my colleagues participated in the NORD 4 certification. Today I want to tell you how to correctly share the operation of a data center or server between my team and contractors.
It is difficult to steer a data center only on its own or by a contractor . Over the course of my experience, I have not met a single option in its pure form, mainly some kind of hybrid. What a team will do and what contractors - each company determines for itself, based on finances, convenience, the availability of qualified engineers (try to find a specialist in DDIBP in Tula), and sometimes politics. No matter how wonderful your contractor is, there are moments that are best left to yourself. We will talk about them below.
What is the whole operation of the data center / server
Before we go to share the operation between our own team and the contractor, remember what is included in this process. I will not describe in detail on each item - on this topic you can write entire books. I will single out only the main points that can be conditionally divided into technical and organizational .
Technical points:
- Maintenance of engineering equipment and systems;
- repairs;
- replacement / modernization;
- monitoring and detours / inspection of equipment and systems;
Organizational highlights:
- record keeping (instructions, regulations);
- collection and analysis of statistics on equipment breakdowns and repairs;
- purchase, storage of spare parts and consumables;
- control over the installation of IT equipment;
- maintenance planning, assignment of work orders;
- staff training and education.
What can not be given to the contractor
Everything that is written in the technical part can and sometimes needs to be outsourced. In this case, you have only the function of managing and controlling contractors. Who should do this on your part, I will tell a little lower.
With the organizational component is more difficult. Almost all of this list will have to be done independently. Let's see why.
Record keeping . Regulations and instructions are needed to ensure that the entire operating team has the same idea of the processes and algorithms of actions (for example, how to test the diesel generator set). And also so that the “sacred knowledge” does not disappear with the sick or quit engineer Vasya. In theory, the writing of documentation can also be entrusted to the contractor - moreover, not every server engineer can or will want to deal with pieces of paper. But the truth is that no one knows your processes better than you, and to keep track of all the changes and maintain the relevance of the documentation, without constantly working at the facility, is from the category “mission impossible”. Alternatively, together with the contractor, you can develop documentation, and monitor its relevance already on the spot.
Collection and analysis of statistics . The situation is about the same as in the previous paragraph, so we take a pen / keyboard and methodically write down the “medical history” of each air conditioner, DGU and further down the list of equipment. Once a quarter, six months, or at least a year, we look there to understand what and how often we break down. The information will come in handy when preparing a budget for the operation, planning spare parts, and will also help to determine if there is equipment that repairs will no longer help, and it needs to be completely changed.
List of breakdowns and types of repairs for one of the air conditioners.
Control over the installation of IT equipment and power management . Many forget about it, but in vain. An IT specialist saw a free unit and stuck in the equipment, not seeing if there was enough power in this rack, cold, and whether it was installed correctly in general . And then all the complaints to the operating engineer are for blinking power (due to the fact that the server with one power supply is connected without an ATS or both power supplies to the same PDU) or the equipment brakes due to local overheating.
To reduce the number of problems in this area, make clear instructions, checklists for those involved in the installation of equipment, and periodically check how IT equipment is installed (especially carefully if the room load exceeds 50%). The frequency of inspections will depend on how often new equipment appears in the machine room.
An algorithm for processing a request for the installation of new equipment.
Work planning (maintenance and work orders) . Together with the contractor, we agree on a work schedule based on staff workload (there should be no work on all systems in one week). We also issue work orders and coordinate with the contractor the form of acceptance of work (certificate, check list, etc.).
Budgeting . Better do it yourself. Depending on how you have it, every month, quarter, or immediately for a year, operational or investment. I’ll write separately about budgeting on my own. If you give it to the contractor, guess what will happen to the budget? Correctly, most likely, he will grow. This will not even happen from the mercenary intention of the contractor, but simply because he will not be so concerned about saving as you would do.
Even if you somehow managed to give the contractor all of the above, then sitting with your feet on the table and just paying bills will not work: contractors need to be trained and supervised .
Contractors need to be taught firstof liferules of work in the data center and server. In addition, “do not drink, smoke or row,” there are technical nuances. For example, the contractor should find out from you that during maintenance of air conditioners it is impossible to disconnect more than one at a time, and before disconnecting, you need to check that the rest of the air conditioners are working properly.
Control over access to the facility will also remain on your shoulders. Check the relevance of the lists, the schedule of access to the object (round-the-clock or only on business days), the presence of crusts for electrical safety and other necessary certificates - yours and only your task.
In general, remember that you, and not the contractor, are ultimately responsible for the performance of the server or data center.
Excerpt from the rules of work in our data centers for contractors.
"Chief Engineer" - responsible for everything
The number of people in your operations service will depend on the SLA declared, the amount of infrastructure, and how much you plan to do on your own. I won’t tell you the universal formula, but here's what you can rely on.
In what mode do we provide services? If 24x7, you need a round-the-clock support service of at least four people who will work in four shifts - a day in three. If 8x5, then people will need half as much.
How many engineers do you need? Here a lot will depend on the functions. If you just need to follow the monitoring, then one is enough, if you need to make detours - at least two people. If you have to do something with your hands (pull crossovers, mount equipment, change filters in air conditioners), then you will need three.
Do you keep spare parts and consumables at home? If you store almost everything, then you will need a storekeeper or a purchaser who will monitor the balances and order new ones.
This is what the team of our NORD 2720 rack site looks like.
The name of the posts and the number of people will be different for each case, but one function must be present in any situation. This is the function of being responsible. Conventionally, I call this position “chief engineer”. In our hierarchy, this is the head of operations. Its main function is to make decisions that are not discussed: whether it is necessary to call the contractor for an emergency call, whether it is possible to postpone the repair of the backup air conditioner. He also gives the command to turn off the equipment during maintenance, coordinates urgent repair work, unscheduled purchases, manages the operation to rescue the data center in case of accidents. You can turn to him as to an arbitration court if the operation engineer or contractor suddenly cannot agree with the power engineer about test launches of the diesel generator set.
In general, the “chief engineer” is ultimately responsible for the entire operation and engineering infrastructure for the business or customers.
To summarize. The minimum program for the data center or server operation service is as follows:
- monitoring and training contractors;
- regulation of access to the facility;
- assignment of work orders;
- coordination of maintenance schedules;
- record keeping and accounting;
- analysis and collection of statistics;
- budgeting.
If you have questions, write in a personal email or come to my next seminar on July 4, you can ask about everything personally.
Other articles on managing the engineering infrastructure of the data center and server:
→ The path of electricity in the data center
→ Errors in the design of the data center that you will experience only during the operation phase
→ About the relevant operation of the data center
→ How to test diesel generator sets in the data center
→ Monitoring of engineering infrastructure in the data center. Part 1. Highlights
→ Monitoring of engineering infrastructure in the data center. Part 2. Power supply system
→ Maintenance of data center engineering systems: what should be in the work contract
→ Dumb ways to die, or why data centers “fall”