Operating practice: 1000 days without downtime of the TIER-III data center

Published on October 16, 2014

Operating practice: 1000 days without downtime of the TIER-III data center


    Oxidation of the battery jumper contacts caused heat. An external examination showed no signs of oxidation, since it occurred between the battery terminal and the jumper tip.

    A couple of weeks ago my colleagues and I had a small holiday: 1000 days of continuous data center operation without service downtime. In the sense - without affecting customers' equipment, but with regular and not so much work on systems.

    Below I will talk about how my colleagues and I serve the data center of increased responsibility, and what are the pitfalls.

    Maintenance work


    At the beginning of the year, a schedule of routine maintenance and preventive repairs for the next year is drawn up. This is similar to car maintenance: work, nodes, frequency, who is needed for this are prescribed. Node after node must be inspected, checked, cleaned and ringed. During such regular work, the biggest thing we did in almost three years was to change heat exchangers on chillers and parts of compressors. We have N + 1 redundancy there, so the shift went to work, made sure that everything was fine, one unit was turned off, and there was a replacement, then the unit was tested and returned to operation.

    Of the small replacements, it is worth noting the preventive replacement of the UPS batteries in the lines, fans, different capacitors. It’s very convenient to work with capacitors on our site (as you see above, we have the opportunity to simply take a picture of the board on the thermal imager and immediately see what is heating). In the photo above, we rang the circuit and found that the capacitor had lost twice from the calculated capacity, immediately replaced in place.


    Culprit of triumph The

    thermal imager steers. Here, in the process of charging, the temperature rose above normal on a faulty battery.

    During routine maintenance on critical systems, we notify customers. In general, we should not do this (TIER-III and the lack of influence on their equipment allow it), but we still have a data center with increased responsibility, so we consider it a good form to warn. At the appointed time, the reserve unit is switched off, specialists inspect it, check it, clean it if necessary, change the lubricant, and carry out other work.

    This is done by the operation team, which received special training specifically for our data center. The team consists of shift specialists (dispatchers), as well as engineers working on a normal schedule with weekends and holidays. Everyone was trained, some on diesel systems, some on working with UPSs, some on ventilation. The team may temporarily include specialists of contractors, but always accompanied by our engineer (for example, from the field service group of customer data centers), who has the appropriate training to monitor work on site.

    A predetermined schedule of routine maintenance can change in case of failure of the nodes - for example, if there was a replacement, inspection is postponed until a new node develops the corresponding resource. But in our practice, it was precisely on the Compressor site that such changes in the schedule did not happen.

    The team regularly passes recertifications on electrical safety and other industry rules. We regularly run training alarms “on paper” or bring people into the hall and say: “Is it something that you’ll do?” - and time out. Our colleagues from the 3D school have already made a full data center simulator for the photo, soon we will be able to use it for training alarms. Well, or drive on it in Counter Strike - have not yet decided.

    A monitoring system has been deployed in the data center, which connects to all nodes and gives their status to the dispatcher. In addition, 4 times a day is required physical tour and visual inspection of equipment. In case of failure of the monitoring system, there is an instruction to increase the number of detours (once it was useful during routine maintenance).

    Emergency response


    In case of emergency, there are several packages of instructions:

    1. The dispatcher in the control has an emergency plan in steps, what to do. It is formulated as simple and unambiguous as possible. For example: switch something, make sure that the green lamp lights up, switch something, check something there.
    2. The same plan is right next to the node that is being described. In theory, even an administrator (who is not part of the maintenance team) can execute the instructions in a critical situation, but in practice, administrators usually do not have access to the engineering rooms, plus they do not have the right to switch quickly. The dispatcher can see the instructions both at his workplace and near the failed node. One part of preparing a dispatcher is to know by heart where the switch is located. Nevertheless, if he is confused, there is always a scheme nearby.
    3. The fire shift has its own instructions. They also have regular trainings, but the main thing is that there are always two firemen at the facility with oxygen masks and special suits that allow them to walk around the train rooms in case of fire, smoke, or gas start-up. Firefighters and other specialists not from the dispatch shift also have special instructions that imply interaction with other services: IT specialists, security personnel, and so on (who runs to where, who talks to whom). For example, during a fire, everyone should run out of the hall, because the gas of the fire extinguishing system effectively displaces oxygen and you can only move around the hall in the instrumentation.
    4. The dispatcher also has an escalation scheme in case of an accident: who should be notified, how quickly, in what sequence, if you need to call contractors, whom to call.
    5. A short list of telephone numbers of specific specialists, whom to call in case of questions or emergency situations, is also always at the disposal of the dispatcher. We do not add escalation schemes and telephones to the usual emergency instructions in order to keep their volume to a minimum, we arrange everything in separate “emergency envelopes”.


    Practice cases


    They often try to get to our data center with a meal or a bottle of mineral water. According to the rules, we let customers and contractors into the hall and other critical premises only accompanied by our specialists. About once a month we take away an apple, a sandwich, argue about outerwear (despite the cold, according to the rules you can only go in a sweater to the maximum, so that nothing sticks out and does not wave). Fortunately, people usually understand and agree. If something abnormal happens (for example, the customer will try to pay a very dusty fee or the girl with the hair loose to the floor will come from the customer), the dispatcher will call the person in charge and clarify their actions according to the rules of the emergency.

    Once there was such a case. The telecom operator installers pulled the cable through the city - through the wells. Just at that time it started to rain, and two lumps of mud in boots got to our facility. These beautiful people entered the control zone and began to leave behind a plentiful trace of ectoplasm containing all the details about the route of cable laying. The work, of course, had to be postponed - they simply did not have clean work clothes.

    Each incoming is being instructed. Specialists of the customer, as a rule, are simply about the behavior on the object. Engineering staff - additional briefing on the nodes and rooms where the person is going, and, in particular, on how to evacuate.

    There were very few contingencies at the Compressor during all this time, and we are proud of it. From what can be remembered, it is worth noting two cases.

    The first time there were problems with the contractor when pulling the cable. The fact is that from the experience of about a hundred built and maintained data centers throughout the country, we know that there are no ideal installers from a provider. Once a time is not necessary, and sooner or later there is a risk of damage to neighboring cables when laying their own. Separate entries were made in the “Compressor” so that each telecom operator could lay a small ring through different cable channels (independent routes). Once we realized that we had insured for good reason: insufficiently well-trained installers, by negligence, incised someone else's cable, but nothing happened.

    The second time they brought us racks from the fire - all in soot, with a specific smell. The dispatcher reacted to an emergency, we still were not allowed to bring racks into the gym. Firstly, dirt, and secondly, the smell is potentially dangerous - confusing. He will just worry about the next admins, but our team can get used to it, and this is extremely undesirable. Gas analyzers, by the way, do not react to smell, only to really small trace amounts of smoke, so there would be no problem with them.

    Repetitive work


    The premises must be cleaned regularly. Even with overpressure, cleaning is sacred. There is a schedule where the room and type of work (dry, wet or wet cleaning) are prescribed, as well as regularity. Depending on the type of premises, the cleaning is done either by a cleaner accompanied by an engineer or a dispatcher, or our specialist with permission. In white space cleaning is done once a week and strictly with responsible persons. At engineering levels, equipment does not open during cleaning, but is cleaned during scheduled maintenance.

    Once a week, diesel launches are made - just no-load runs. There are diesel engines with a full load. There is no fuel replacement procedure - it is tritely developed. By the way, we always fill the winter. Regular water control - a special paste is checked, plus separation is controlled.

    To bring in and take out equipment according to the standard procedure - approvals take 1 day. But in case of failure, we shorten this process - we do not interfere with fixing critical systems.

    Racks and installation have their own internal requirements. So, there is a control over the accuracy of installation (for example, it is important that the cable does not fall out of the rack, otherwise the likelihood of a hook increases even in the fence). Such requirements usually do not raise questions.

    We fail the cable when ordering rack places, when it is clear what kind of power is needed. The cable is checked before and after installation. Once on our other site there was a case when the ordered coil arrived, and even when unwinding, the installers began to suspect something bad. Checked - yes, the insulation did not pull in resistance. I had to return the coil and wait for a new one. In general, such situations are not uncommon; the cable must be checked immediately after receipt.

    CCTV


    The data center uses both our usual video surveillance and customer cameras. Given that we have banks, insurance and retail, it happens that a separate block of racks is enclosed by a metal grate and locked. You can only get inside with a customer representative. Therefore, all our systems have been moved beyond the boundaries of such a fence.

    Most often, customers place their cameras on racks, but sometimes they ask to attach to a cable-bearing structure, for example. We evaluate the location, in particular, check that no one else's racks fall into the frame. As a rule, we allow, sometimes with minimal location corrections.

    We set our observation in the hall in advance. Although the racks are different, but not so different as to break the ranks (the hot and cold corridors in our area are determined by the structure of the building). In general, when planning the placement of equipment, a calculation and several approvals are mandatory for all subsystems. At the same time, the equipment itself is checked - for example, whether the rack is blowing in the right direction, whether the cold picks up, does not throw the hot down.







    References


    Photo tour of our data center
    About infrastructure
    About construction

    And I hope the old omen about the fact that it’s worth noting 1000 days without a glitch and telling someone about how the breakdown will happen immediately will not work. Should not)