Power supply of IT equipment: security or continuity? part 2

    We continue the article, the purpose of which is to share experience and show the key features and frequent errors arising in the design and organization of power supply subsystems of the IT infrastructure and the data center as a whole. But I would like to expand the audience a bit and devote several sections to the basic elements of electrical safety and protection of equipment and people.

    Those who missed the first part or want to remember the first part can be passed here .

    For those who understand what an automaton and RCDs are for, what they are needed for, what they protect against and what are they protected from - go to the section Do you need RCDs for IT equipment, server room, data center? .

    Part two

    Let's see what is the relationship between energy and the end IT equipment, we will understand the question of in what cases the power outages the operating system is guaranteed to work without failures.

    Issues of switching to a backup power source The power

    supply of information equipment is organized with redundancy. Consider the organization of power supply in terms of SCHBP-BRP-BP (uninterruptible power supply shield-power distribution unit-power supply). Backup types are of the following types:

    1. Redundant cabling to the rack, equipment, using separate power distribution units, PDU (Figure 1)
    2. Redundant power buses in the power supply panel, using separate power distribution units, PDU (Figure 2)

    Redundancy at the power supply level directly in the server, switch, IT device (Fig. 3)
    Redundancy using a rack-mounted load switch, rack-mount automatic transfer switch (ATS, also ATS) (Fig. 4)

    To switch between the main and backup inputs can be used:

    • in the field of information systems: ABP / STS (Static Transfer Swith) cabinets for high-power systems, for switching to power from a backup UPS at the time of operation of a full-fledged 2N system or combinations of N + 1 systems;
    • in the field of power supply systems of various types of circuit breaker circuit (on contactors, on controllers);
    • at the server rack level: automatic high-speed rack-mount automatic transfer switch \ ATS (Automatic Transfer Switching);
    • at the level of specific information equipment: duplicated power supplies.

    As we quoted above for IT equipment, “a break in power supply is not allowed.” And what is hidden under this phrase? What is the "break" in the power supply of information equipment? Now let's look at a live example.

    The customer implements the local server along with the IT infrastructure of two floors under the company's office. At the discussion stage of the power supply system, he has a desire to put all the information equipment with one power supply (PSU), and leave the second slot for the server BPs free, and mount a single ATS rack-mount version for the entire rack. (Fig.4, scheme).

    Appearance of the back side of the server with duplicated power supply units.
    As the Customer argued his desire :

    • Cost savings ($ 500-800 per device per rack)
    • You can put two simple PDUs and apply them to power distribution after ATS
    • Absolutely similar level of system reliability, compared to the classical distribution method

    We took time out, studied in detail the desire of the Customer from various points of view, the reliability of services in general during the warranty and post-warranty period, as well as:

    • cost (savings) of capital costs in the implementation (CAPEX)
    • the cost of depreciation, maintenance of spare parts, labor costs of client personnel ( OPEX )
    • comparison of operation algorithms and switching time to the backup line in both variants, checking for “single points of failure”
    • the level of risks of lagging and / or rebooting the operating systems of information equipment, the fall of information services that run on them.

    And it turned out that:

    According to the regulatory framework GOST 32144-2013 (Electrical energy. Electromagnetic compatibility of technical means. Electricity quality standards in general-purpose networks. Introduction date - July 1, 2014), the main cause of failures in information equipment can be voltage dips, which
    usually occur due to faults in electrical networks or electrical installations of consumers, as well as when connecting a powerful load

    We read further:
    duration of voltage sags can be up to 1 minute
    This phrase tells us that the information equipment should be provided by the UPS and / or high-speed AVR, since voltage dips of similar duration are acceptable and normal in terms of high energy, but will be fatal to IT equipment and services.

    By the way, it is worth noting that at the moment there are contradictions in the current regulatory framework of the Russian Federation in terms of measuring values ​​related to the quality of electricity, you can read more in the article of our company's technical director Viktor Cherdak (source digitalsubstation.com )

    Some excerpts from the article

    В последние годы государственные стандарты в области измерений параметров электрической энергии, относящихся к КЭ, активно развивались и были неоднократно переработаны

    Важным изменением стала замена ГОСТ 13109-97 «Электрическая энергия. Совместимость технических средств электромагнитная. Нормы качества электрической энергии в системах электроснабжения общего назначения» [16] на ГОСТ 32144-2013. Данные стандарты определяют различную номенклатуру показателей качества электроэнергии.

    But how fast? How to determine that time in milliseconds, during which the service (and server) of the customer will not fall, and the operating system will not go into the "critical error"?

    There is the CBEMA (Computer and Business Equipment Manufacturers Association) standard, which after some adjustments is now known as the “ITIC curves” (Information Technology Industry Council), and its variants are included in the IEEE 446 ANSI standards. According to these standards, electronic power supply circuits must remain operable for 20 ms (or 0.02 seconds, i.e. a period).

    Those very ITIC curves.

    According to the requirements for server and computer systems, the Server System Infrastructure can be said that the parameter of the power supply is Tvout_holdupduring a power supply voltage failure, the information equipment will operate for at least 21 ms. That is, the full period of the network is the guaranteed time for normal operation of the server or switch. The Tpwok_holdup parameter is specified at least 20ms.

    Some details on the SSI parameters can be found here.
    Справка: Hold-up time (время удержания) — это временной промежуток, в течение которого блок питания может поддерживать выходные напряжения в определенных пределах после пропадания на его входе питающего напряжения. В большинстве компьютерных блоков питания Hold-up time характеризует еще и через какой промежуток времени power good сигнал (PWR_OK) скажет системе, что напряжения, вырабатываемые блоком питания, нестабильны (для компьютерных блоков питания этот параметр обычно более 16 мс).

    Вот одна из таблиц из документа

    А это диаграмма (time-line) с регламентируемыми алгоритмами работы БП

    Now let's see how long the switching time is stated by APC, for example, for a rack-mounted load switch of the brand AP7721 . We see that here we usually have 8-12 ms, but 18 ms is the maximum switching time.

    We can conclude that the switching time to the backup input for a rack load switch corresponds to the specification of the power supply unit of the server equipment. It turns out that there will be no failures in the operation of information equipment.

    Summary of the timings of the elements of the system

    And what about the economic component and which of the options is more profitable and fault tolerant?

    Suppose we have three small servers in the rack, into which we can put two power supplies and three devices with non-duplicated power supplies. All are critical and the failure of any device will lead to the failure of the customer’s entire system. Rack load switch we need in any case. It is about 18 thousand rubles.

    The customer declares that they do not need a PDU (PDU), which means that only the ATS value will be in the budget - the same 18 thousand rubles. As a replacement for power distribution units (PDUs), the Customer suggests using on-board power distribution of a rack load switch. Also, the Customer plans to buy a server with two slots for power supplies, but in a complete set with one power supply unit for the sake of economy. (Figure 4)

    Classic version (Figure 3)assumes a set of 2 PDUs - about 32,000 rubles, 3 additional power supplies to servers for $ 500 each for 84 thousand rubles in total. ATS for the same 18 thousand rubles. Having added everything, we understand that the classic solution will cost the customer approximately 134 thousand rubles.

    It seems that the customer is right, the money is completely different. But let's look from the point of view of fault tolerance and ease of maintenance of both options:
    Customer option: Single point of failure- rack load switch. If something happens to him, then we lose the entire rack entirely. So, you need to have a spare parts kit right on the site, which adds 18,000 rubles to the estimate. The power supplies in the servers cost one at a time; they are also points of failure. Therefore, it is desirable to have at least one, and preferably all three power supplies in reserve on the site. Let us assume that we need three power supply units in spare parts and equipment - this is plus 36 thousand rubles. You need to check the power that the rack-mount ATS can switch. Now we assume that 3 kW or 16A will be enough for all the equipment in the rack. If we need ATS for 32A (7kW), then it will be much more expensive (more than 100 thousand rubles). That is, the budget option of the Customer with a detailed review of reliability increases to 160 thousand rubles.. In this case, in the case of emergency, despite the fact that spare parts will be on site, you will need down-time to replace the device.
    Single Point Of Failure (SPOF, Single Point Of Failure) - a node, communication line, or data accessibility system object, the failure of which can disable the entire system or cause data unavailability
    Open Technology Option : As per Figure 3 , but if necessary, add ATS for small network equipment with a single power supply.

    The point of failure is the same ATS. If something happens to him, then we lose the entire rack entirely. We agree that it is necessary to have a spare parts kit right on the site. But in our case, if only ATS refuses, then this can only affect the operation of switches and auxiliary equipment. The servers themselves will continue to work quietly. Power supplies in the ZIP are not needed. Since if one of the duplicated power supplies fails, the server will continue to work on the rest, and, most likely, it will wait for a new power supply from the vendor, regardless of the site distance.

    Interpretation of the term SPOF as applied to IT systems
    Единая точка отказа (SPOF, Single Point Of Failure) – узел, устройство или точка схемы, отказ которого может вывести из строя всю систему, вызвать недоступность данных и сервисов. Рассматривается при разработке и проектировании любых критически важных систем. Полное отсутствие единых точек отказа ведет к значительному увеличению капитальных затрат при внедрении, поэтому критичность работы той или иной системы, сервиса определяется на этапе проектирования исходя из бюджета проекта, а также пожеланий и требований Заказчика. Мы всегда находим вариант идеального решения для каждого Заказчика, определяя несколько вариантов реализации проекта, и предлагая их Заказчику. В результате на этапе сдачи проекта заказчик получает именно то решение, которое он хотел видеть по соотношению цена/качество/надежность.

    Thus, it is possible to connect all the equipment of the rack to a single ATS, but not rationally, since in this case we get a single point of failure on the power supply. Purchase of servers with duplicated power supply units is preferable in any case, since the fault tolerance at the level of information equipment increases significantly.

    The rack-mounted load switch ensures correct and almost instantaneous switching to backup input, the information equipment will not even feel it, software products and operating systems will continue to operate correctly. Rack-mount power distribution units in any case are needed and you do not need to save on them. Visible savings in capital costs for power distribution can lead to unsolvable problems during operation, for example, the need to “extinguish” the entire rack just to move the ATS to another unit or to revise the rack load switch. In any case, duplicate power supplies should have a spare parts kit, but it is not always possible or available.

    Appearance of removable server power supply:

    The use of rack-mount AVR has its own characteristics
    Например, мощность такого АВР ограничена, и переключать он может комплекс сравнительно слабых с точки зрения потребляемой мощности нагрузок. Есть вопросы к количеству выходных разъемов питания. Например, вышеупомянутый ATS AP7721 оснащен по входу разъемами типа С14, что означает максимальную мощность переключения 2,5 кВт. На большую мощность нагрузки существует 2U модель AP7724, который по входу комплектуется разъемом на 32 А, то есть максимальная мощность оборудования может быть до 7кВт. А это значит, что типовую стойку с оборудованием можно подключить на этот АВР полностью. Однако цена подобного решения будет более 100 тыс. рублей.

    The work of information equipment with two power supplies was well described in the article by Vadim Sinitsky @ dimskiy . As you can see, there are advantages and disadvantages. And the presence of redundant power supplies for information equipment in any case is necessary, especially if the object is outside the zone of fast delivery of the power supply from the vendor. In addition, we want to note that online calculators for calculating the capacity of new servers from vendors can only be used as a guideline for system administrators and customer personnel.

    The real possibilities of connecting the new powerful server to the existing rack should be assessed taking into account the initial power supply design, current state and load of the rack, server, UPS, generator .... In terms of connection to the rack, it is also worth considering:

    • current PDU capabilities, such as loose connectors in them
    • ratings of automata in the boards and the cross section and the phase of the cable line to the rack.

    Separate attention should be paid to the reliability of the server power supply system, if it is built on the system shown in Fig. 2 (with two bus systems), the presence of a new powerful server may, in the case of repair work, lead to an overload of the entire power supply system, reduce the battery runtime of the UPS , force the UPS to transfer to the bypass for overload and so on ...

    And how is your distribution system built in the rack?
    What is the BP resource for IT equipment and the algorithm for their software redundancy?
    Which PDU do you prefer to use: basic, monitored? How useful is the “managed PDU / PDU” function in practice and has it ever helped you?

    Author: Oleg Kulikov
    Leading Design Engineer
    Department of integration solutions
    "Open Technologies"
    Registration in the National Register of Specialists "NOPRIZ" P-045870

    Only registered users can participate in the survey. Sign in , please.

    What type of power distribution unit (PDU) has the best feature set?

    Also popular now: