Independent Acceptance of Data Center



    Hello! My name is Kirill Shadsky, I am the head of the department for managing external data center companies of DataLine.
     
    This article is devoted to the most important aspects of the acceptance tests, as well as possible problems and pitfalls that can spoil a lot of nerves to novice “testers”.

    So, imagine: a satisfied contractor reports to us about the five-year plan in four years that there are no problems and the facility (data center or private hall) is ready for operation. It would seem that now is the time to begin testing, but ... in fact, we are already late. Acceptance tests should be planned at least at the design stage.

    The very first question - who to entrust the test? Of course, builders! After all, it is much easier than independently checking each node or hiring an independent commission. Just in case, I clarify: this is a joke. If everything were so simple, this article would not have been written.
     
    Any contractor will be happy to check what he built himself. It is very good to look for shoals from yourself and to hide them in another place.

    Remember: even the best and proven contractor is an interested person and everything he hides may become a problem in the future. Therefore, always either carry out acceptance tests yourself, or contact an independent organization.
     
    If you are experienced and the test does not scare you, you can spend them yourself. I will try to tell you in detail how the acceptance tests are arranged with us and what problems we face at different stages.
     
    In DataLine there is a directorate for capital construction, which is engaged in the construction of new halls and data centers. After construction, all this becomes the responsibility of the maintenance service. For her, it is important that everything is built with high quality. Our technical director Sergey Mishchuk is a kind of “magistrate” between these two divisions of the company.
     
    Despite all our experience, every time during the tests we find a variety of shoals: both serious and small. This is absolutely normal. It is necessary to catch them during the tests, than to wait until they become problems. Here are some examples.
     
    In 99% of cases there are complaints about sealing the holes between the walls or rooms. This situation is quite understandable: first you need to lay the SCS, power cables, freonoprovod and other pipes, and the sealing is postponed to the last moment. Therefore, be sure to ensure that it is completed before the test begins.
     
    We are obliged to make hermetic zone containment. All data center premises are located in separate containment areas, “house in house”.
     

     Top view of the containment area

    If there are tight zones in your data center, you must spill them with water from a hose and make sure that nothing flows.
     
    Do not get away from the garbage. Under the raised floor will inevitably come across cable trimming, screws, bolts and other files, forgotten workers. No matter how I conduct checks, there are always comments.
     
    If you do not force the workers to tidy up immediately, everything will remain lying when they bring and install the equipment. What do you think is easier? Eliminate on the spot or sweat with a flashlight under working racks and clean construction debris?
     
    And all this is just the tip of the iceberg, the problems cited for a common understanding of the picture. Now we will examine each test stage in detail and start from the “zero mark”, namely, with planning.

    Preparing for the test



     
    In almost every article we talk about the importance of preliminary planning, and today we will not interrupt this glorious tradition either. Moreover, planning should be your first (if not “zero”) test step.
     
    The Uptime Institute recommends that planning and creation of a commission for acceptance be started at the stage of the draft data center design, and the start of the verification work is already at the design stage.

    We start with the acceptance of the project, without this we can not do. It is best to carry out the acceptance before construction, at the design stage. Remember: it is always easier to correct what is “on paper” than an object that has already been built. In some cases, “slightly tweak” the finished data center is not at all possible.

    Also in your test plan should include the following items:

    • What will be tested?
    • When will the tests be conducted?
    • Who will be tested?
    • Which of the company's employees will be involved?
    • What tools and equipment will be needed (current tongs, vibrometers, thermal imagers, anemometers, and many other incomprehensible but necessary things)?

    For each test, we compile a list of the systems to be tested, since in different data centers each department is responsible for its equipment. In one place we will check only electricity and air conditioners. In the other, other systems can be added to them, for example, AUGPT, video surveillance, ACS (as agreed with the security personnel).

    We pay special attention to the building itself. As a rule, the brand of concrete and how the floors are poured are not our patrimony and specialization, but we must check the raised floor, doors, water supply and sewerage .

    In other words, before testing, you need to clearly know what and where we will be testing in order to avoid overlaps and confusion.

    Important note: when you check a particular system, the one who built it or another responsible person should be near you. Applies to all stages.

    In general, the acceptance tests include the following steps:

    • Project Verification
    • Documentation check
    • Autonomous testing
    • Comprehensive checks

    Separately, we consider each of them.

    Verification of documents




    To skip this stage and even more so to carry it out in parallel with stand-alone testing is by no means impossible. Even if time is running out, you have to be sure that each piece of equipment and each system corresponds to the one stated in the project. Without checking the documentation, you will not be able to perform further tests with high quality, let alone the legal side of the issue.

    The full list of documents to be checked is individual and depends on your configuration.

    I give an example of the documents that must be checked during the tests:

    • executive documentation for each system;
    • passport for equipment;
    • act of technological start-up;
    • the act of measurement and testing;
    • acts of checking the crimping system;
    • report of the laboratory for measuring the resistance of the ground loop and other cable communications;
    • equipment installation instructions.

    There is also operational documentation. It is not always specified in the construction contract, and if not, request it from the contractor with an additional agreement. The operational documentation should contain instructions and algorithms for basic switching, but we will return to this in the section on complex testing.

    In addition to all of the above, it is highly desirable, I would even say, of course, to make the table of loads. Unfortunately, they are not always done, but this is quite an important and convenient document.

    What is it for?

    Typically, redundancy in the data center is organized on two beams of power, and you need to understand what load will go on one beam due to the complete power failure on the other.

    It would seem that the general scheme for this is quite enough. But it will be much more convenient for your specialists to work with the tables. Less chance of missing something or getting confused.

    Of course, we cannot reconcile every act with reality, but we must make sure that all acts exist.

    Offline checks




    Stand-alone checks are the next step in the data center acceptance tests. Here you need to manually check every piece of equipment: performance, settings, work at maximum load and, of course, marking - where without it :) It is important that the marking coincides with the project. But it is equally important that it coincides with reality.


    Example of glycol circuit marking

    For example, for the power distribution system, we apply a test load and physically enable / disable each circuit breaker in the switchboard. And, starting with IT equipment, we go through each rack in turn, make a table and make sure that when the machine is turned off, the corresponding hardware is turned off.

    Of course, sometimes in switchboards magically appear automata, which were not in the project. It's okay, as long as the load does not exceed the norm, and this was noted in the documentation.


    Correct switchboard

    For equipment such as air conditioners, diesel generator sets and UPS, we carry out simple stand-alone checks: on / off, operating modes, settings, etc. Oddly enough, it is important to check how well the equipment is fixed. We had cases when important nuts could be turned off almost with a finger.

    The first round is over, and we give the installers time to correct the flaws, then return, and everything goes on the second round.

    It is said that among themselves the workers call them the circles of the assembly hell - very often on the second inspection we find shoals that we have not noticed before. And it begins: "And what did you not say at once?"

    You can understand people, but here it’s almost like in the film “Beware of the car”: you catch up, and I run away. Just the opposite: you eliminate, and I find.

    Under the spoiler there is a list of the most important autonomous tests that we carry out.
    Cooling:
    • visual inspection of equipment for compliance with the requirements of the installation manual;
    • verification of reliability of fixing of pipelines, isolation of pipelines and their joining;
    • verification of the reliability of the electrical equipment in the electrical panel (automatic, magnetic starters, pads);
    • check the control panel for performance;
    • проверка алгоритма работы программного обеспечения оборудования: переключение с рабочего на резервное после имитации аварии, проверка ротации по времени (при наличии).

    Электроснабжение:
    • визуальный осмотр оборудования, проверка на соответствие требованиям руководства по установке;
    • проверка на соответствие системы и ее компонентов однолинейной схеме;
    • выборочные бесконтактные замеры температуры (с указанием мест проверок).

    ДГУ:
    • проверка панели контроля и управления;
    • проверка корректной работы световой и звуковой индикации;
    • проверка наличия проблем при тестовом пуске ДГУ в автоматическом и ручном режимах;
    • проверка работоспособности ДГУ в течение 6 часов на 30% проектной нагрузки.

    ИБП:
    • проверка автостарта ИБП при разряде батарей до предельно допустимого уровня, проверка времени автономной работы (при работе на 100% проектной нагрузки);
    • сверка основных параметров ИБП при работе на 100% нагрузке;
    • проверка вывода ИБП в bypass в автоматическом и ручном режиме при работе на 100% проектной нагрузки.


    When everything functions as it should be, the stand-alone tests are completed, and the fun begins: complex tests.

    Complex tests




    Let me make a small digression here and talk about what a data center is and what is important for its operation.

    First of all, the data center is a single system, almost a living organism. And on how all his organs will interact, his “health” as a whole depends.

    For example, air conditioners often tell us: “What do you dislike? See, it blows and cools! That's all right! ”

    Specialists for DSU echo them: "Look, everything starts and even gives electricity!". And in general, each piece of equipment works well (we checked it on autonomous tests), but only by itself. It is necessary to start everything together, and the system falls apart. It is to identify problems related to the joint operation of the equipment that complex checks serve.

    The amount of testing may vary depending on the level of redundancy: the more interconnected systems, the more work options need to be tested and debugged.

    For example, if we build a Tier III data center, it is necessary that every element of the infrastructure, including cable routes and distribution routes, can be safely disconnected for replacement or repair. Accordingly, the number of required tests is growing. We consistently perform shutdown / decommissioning of different equipment when the data center is under load. Changes in one system should not in any way lead to failures in adjacent ones.

    Important update number 1: all complex tests are carried out under load. In 99% of cases, heat guns are installed right in the engine room, and the data center is “burned out” - this is how we check the quality of the engineering systems.

    Important clarification number 2: the main source of power data center are diesel generator sets. The city is an alternative “cheap” source, so we carry out all comprehensive inspections on diesel.

    One of the key systems in any data center is the automation in the main switchboard and the DGU. This system should be checked very carefully. Standard jamb - the transition to the DGU does not occur if the urban input is disabled. This is because the DGU is assembled by some people, and the automation is different, and the equipment does not fit.

    When the system is debugged, it is necessary to prepare a table of settings and prescribe the algorithms of AVR. If you get a very good and responsible contractor (designer, builder) who will document everything himself, all the better. Otherwise, do not be lazy and write the following points yourself:

    1. after how many seconds does the command to start DGU come;
    2. after how many seconds does the transition to the DSU;
    3. Clause 1 and Clause 2 in reverse order.

    Under the spoiler, an approximate algorithm of one of the checks used by us and the Uptime Institute.
    1. Осуществляем переход с городской сети на группу ДГУ, измеряем показатели.
    2. Возвращаемся обратно.
    3. Полностью отключаем один из ДГУ (выключаем связи, автоматы) и смотрим, как система стартует без резервного дизеля. Здесь могут проявиться проблемы, связанные с некорректной настройкой автоматики.
    4. Когда дизельные генераторы проверены, продолжаем работать на них и проводим остальные тесты питания.
    5. Выключаем один ИБП и смотрим, как нагрузка переходит на другой луч. Переводим в bypass и обратно, разряжаем аккумуляторы.
    6. Продолжаем последовательно идти по схеме и отключать распределительные щиты.


    Then checked the air conditioning system. We take turns turning off the air conditioners and, if they have an AVR integrated system, we check it too.

    If the air conditioners are configured to work in a group and automatically switch from backup to main, be sure to check how it works.

    • remove all connections;
    • reboot the controller responsible for the switch;
    • turn off the distribution switch that connects the air conditioners;
    • we test automatics - here failures too often occur;
    • We are doing everything that could be written in the novel “50 shades of the data center.”

    For the glycol system, it is imperative to check the hydraulics by turning off the pumps and taking one of the heat exchangers and one or several sections of the route out of operation.


    Here you can see that each shield is marked and provided with brief instructions.

    Important: if the switchings are made manually, it is imperative that the contractor provides an algorithm. On the markings of valves and valves, the working positions must be indicated (normal open, normal closed).

    Often contractors say: this was not in the test plan provided. You can answer this: crashes do not provide plans :)

    Incidental situations also occur. For example, during testing a UPS for a discharge, an evil air conditioner may come running:

    - What are you, Herods, doing ?! Why did you turn off the pumps?
    - We have not disconnected anything, we are testing the UPS.
    - Why then chillers rape? They can break!
    - That's why we are testing to find such narrow moments.


    Another frequent test is to check the fire extinguishing system. To do this, we disable all automatics from cylinders and test how directions work. It happens that the directions are confused, opening / closing does not work.

    Do not forget about the monitoring system (more about it, we wrotehere and here ). As soon as we turn something on or off, this change must appear on the panel. We also check if the monitoring starts to “blunt” with a large number of alarms.

    Be sure to test the power monitoring. In no case should you lose control over the data center in the event of a contingency.

    We do everything with the hands of the builder


    In the beginning, I wrote that the acceptance tests should be carried out by external experts. But there are things that should be entrusted directly to the contractor. These are demonstration switching on and off of equipment (as well as some other works). The receiving party walks with the checklist and records the results. Like that:

    • The receiving party says: “We need to turn off the air conditioner №34. Colleagues, turn off, show us how you do it. "
    • The builder shows and explains.
    • The receiving party is recording.

    This is a good tone rule.


    It's a question of time




    As you can understand, acceptance tests are a long process. Their duration strongly depends on the size of the data center and the amount of equipment, so below I will give the average (data center for 50-100 racks).

    • Verification of documentation - 3-5 business days of strong designers.
    • Stand-alone checks - 3-5 days to iterate, since you need to check each element of the data center and give the contractor time to correct errors. How many iterations will be, only God knows.
    • Comprehensive checks - 2-3 days, if everything works properly.

    Of course, these figures are very approximate. Do not expect to be packed in 2-3 weeks. Sometimes checks can go several months.

    They built the hall, launched the system - you can conduct an offline check. Checked, fixed everything - launched another system. She also checked and signed the acts. Well, and then, when everything is ready, a comprehensive check is carried out.

    About how we threw up the pipes




    This story happened to the previously mentioned technical director Sergey Mishchuk.

    Once he took the data center inside DataLine and examined one of the new halls. I was young and  green salads, went with a smart look and wrote down. The first thing he did - asked to raise the raised floor tile. The builders raise, and a meter-shaped piece of pipe with a diameter of 10 cm is found under it.

    The builders immediately grab their heads: “It wasn’t, everything was checked, it wasn’t!”. The origin of the pipe remained a mystery, and the builders agreed that Mishchuk himself secretly carried it in the sleeve of his shirt. Summer. White.

    A few years later, when I was no longer so young, but still remained light green, there was a check of the new data center. During the test, I myself first ask you to raise the raised floor.

    What do you think lies there? Right, the pipe. Four times smaller, but the same extra and mysterious.

    The builders remained in full confidence that we ourselves throw up pipes to torture them. I did not dissuade them - the main thing is to remove everything.

    Moral of the story: no matter how experienced and professional you are, there will always be a mysterious pipe, a malfunctioning machine gun or unreadable marking. Do not be lazy to check everything with the utmost meticulousness "here and now", so that later, when you have critical IT equipment in your data center, do not run headlong and do not heal the shoals on the go. Professionalism is not only building a high-quality system, but also testing its performance.

    If you have any questions, I will be happy to answer them in the comments.

    Also popular now: