How 3 thousand rubles and simple methods to increase data center efficiency helped save a ton of money

    During my work, I often encountered problems with the lack of resources of corporate data centers, which can be formulated, for example, as follows: "We do not have enough physical space to place the equipment," "We do not have enough power supplied," and so on and so forth. Solving such problems “head on” leads to the obvious answer - to turn off and decommission part of the IT equipment, or to replace the equipment with a more efficient performance / consumption / physical size ratio.

    In most cases, it turns out that the resources are actually in abundance, but they are used, to put it mildly, wastefully. The problem often lies in the banal gouging or the development of the corporate data center expansively, so to speak, according to inherited principles. Decisions made are not checked for the effective use of available resources, organizations do not have a methodology for checking them, and as a result, we get what we get.

    If you understand for yourself that you can’t live like that, I recommend starting with reading the blogs of such companies as: Krok, Beeline, Data Line. They can find articles where they share their experience in the field of energy efficiency. Their methods work - the PUE of commercial sites is in the range 1.3-1.4 (for someone even smaller), which with TIER III is an excellent result. However, at some point you will realize that they have their own party there with megawatts, reserves and experienced staff. And you have no place on it.

    What can be done to mere mortals, whose data center is 10 racks, 200 kW of power, always lacking hands and time?

    Ideally, you need an easy-to-understand checklist, which you pick up and go for a walk on your site, making notes. It is advisable that this document help you, at least approximately, to assess the impact of the proposed method on efficiency (you do not have experience and best practices). It would be nice if the proposed methods were separated by stages of the life cycle. You gathered, for example, to buy servers and storage, looked into the appropriate section of the training manual, and there are recommendations on the parameters of the purchased iron.

    In general, I will not torment, there is such a document called "EU Code of Conduct on Data Centers". I must say right away that I almost never met people who are guided by them in their work, which surprises me very much. Lies in the public domain .

    So, what is this document, and why it will be useful to you:

    1. This is a collection of best practices in improving the effectiveness of data centers, in the writing of which experts from various fields took part.
    2. It is well structured along the stages of the data center life cycle, which allows you to easily prepare for the replacement of, for example, IT equipment.
    3. It is well structured by subsystem. Therefore, if you have a server maintenance team, they can easily evaluate your contribution.
    4. Any practice has an assessment of the potential impact (from 1 to 5, 1-small impact, 5 - maximum). This will allow you to conduct an assessment at the planning stage, based on implementation costs and expected returns.

    I propose to go over the document, understand how to work with it and consider a couple of examples.
    However, a small warning first. Reliability and energy efficiency are two parameters that often pull your data center in different directions (not always, but often). An example is the increase in temperature in the data center. Reduces air conditioning consumption. But at the same time, we are seeing an increase in the number of revolutions of cooling fans in the servers, which leads to an increase in server consumption (oops ...). And it reduces the resource of the fans themselves, and when it ends, the fans will stand up, and the server will also follow them in temperature. Therefore, any change must be approached carefully, track its impact on adjacent systems and always have a plan to roll back to their original positions.

    So, we take the dictionary, we begin to read. Immediately go to paragraph 2.2 on page 3, where the color coding of the practices is decoded.


    Green - approaches, audits, monitoring, etc. The most effective items in terms of material investments. Most involve either minimal investment (5.1.4 Installing dummy panels in cabinets) or generally zero investment due to changes in operational approaches (4.3.1. Auditing unused equipment).

    Red - the introduction of new software. Complete nonsense, such as "see that processes in the background do not hang and do not load the CPU." You can safely skip. Although, if you have hundreds of applications ...
    Yellow - what to look for when purchasing new IT equipment.

    Blue - what needs to be done at the next reconstruction or maintenance. There are examples of the so-called “retrofit”, i.e. enhancements to existing devices. For example, when replacing the UPS batteries, replace the lead ones with Li-Ion, which will allow you to abandon the air conditioning system and free up part of the area. Or, when servicing the air conditioner, install a speed control device.

    White - optional practices, compliance with which is not required for candidates.
    A small digression is needed here. This training manual was created for operators who want to join the voluntary program “The European Code of Conduct for Data Centers”. Therefore, the term кандидат candidate ’is used throughout the document, which should not bother you. The “white” practices contain good recommendations regarding approaches to the operation and construction of the data center.

    Next, jump immediately to page 9 to chapter number 3. Further movement of the document should be carried out sequentially. The subsystems are described in the order of their influence on the energy consumption of the data center (IT equipment, cold, electricity, etc.).

    Let's try to apply and mentally test the practices of different colors from different subsystems.
    “Green”, paragraph 4.3.1. Impact - 5. It is recommended to audit the equipment used, its installation sites and the services it provides. No matter how ridiculous this may sound, but in many organizations I came across a situation where the question “what kind of server is this?” All engineers shrugged. And this is in server rooms, where 30 servers are maximum. And not to mention the servers that run the service used by 3 people in the organization. Seriously, especially if you recently joined the company, look at the server park from this point of view.

    Paragraph 4.3.2 looks natural. Impact - 5. “Take out unused equipment out of service and conduct regular audits for unloaded devices.”

    Wonderful paragraph 4.3.8. Impact - 4. “Conduct an audit of the environmental requirements of the equipment. Label such equipment for replacement or relocation. ” Suppose you have some fresh servers, for example, under ERP. And a little older, with strict temperature requirements - no higher than 25 degrees. They stand and work, but they do not allow you to raise the temperature in the room. And then once ERP which is spinning on fresh servers has grown and requires more powerful hardware. A new server is bought, which replaces a couple of the previous ones. In this case, the training manual recommends that the replaced server not be laid out on e-bay, but put on the replacement of ancient machines that have temperature limitations. Those. in fact, you are migrating to a new hardware not just one service, but several with the decommissioning of the oldest hardware. Although you did the upgrade for the sake of ERP. In general, look deeper and further.

    Green item 5.1.4 Installing dummy panels in cabinets. And with it 5.1.7 and 5.1.8. At minimal cost, you can seriously reduce the mixing of hot and cold air and increase cooling efficiency.

    Now let's move on to the section on mechanical systems (cold supply). Clause 5.1.2. Impact - 5. This paragraph suggests that we separate the flows of hot and cold air by applying containerization of cold and hot air. The practice is “blue”, i.e. retrofit. Despite the fact that the training manual recommends modernization during periods of planned downtime, these works can be carried out specifically at a working data center, since you only affect cabinet designs. Now there are solutions for the construction of insulating corridors with virtually no tools and no drilling. And once again, I remind you of the relationship. They made containerization - reconsider the settings of the air conditioners, for sure it will be possible, at least, to increase the temperature settings of the supplied air. And immediately you can make a note on paragraphs 5.4.2.4 (Impact - 2) and 5.5.

    The yellow practices are almost entirely concentrated in sub-sections 4.1 and 4.2. They relate mainly to the procurement of IT equipment. It just so happened that engineering systems have a lifespan of at least 10 years. And what you have now, you can only upgrade (ie, “blue” practices). IT equipment is changing much more often, it is possible to apply the "yellow" practices in the next quarter. I will give the following recommendations as an example. “When drawing up the technical specifications for the purchase of new iron, pay attention to the temperature regime of operation.” Thus, you can create a basis for yourself to implement energy management methods without restrictions that create your servers, storage, etc. "Require built-in monitoring tools for power consumption and temperature at the server’s air intake." This will allow you to gradually move from an assessment of resources based on passport data, to an assessment based on real-time data. Naturally, all this will require a change in the monitoring and reporting approaches that are outlined in Chapter 9.

    I do not consider “red” practices in view of my neglect of them. I would be glad if someone can demonstrate their effectiveness in the comments.

    White practices are an absolute hardcore for corporate data centers. The slogans “Give A4 ASHRAE class!”, “Blow air directly from the street!”, “Use UPS - not a man!” Are everywhere. This is exactly the case when energy-efficient games reduce reliability.

    Summary:

    1. The suggested practices are simple enough to understand and implement, not rocket science. You can start right now.
    2. At the very beginning, pay attention to the “green” methods. They have a great influence, are simple, cheap and will allow changing the approach to planning and operation. Which in most advanced cases gives a quick visible effect.
    3. Naturally, the movement should go from the most influential (5) to the least (1).
    4. Make a plan. As a result of the introduction of “green” methods, you will get a complete picture of what you have now. Including an understanding of the technologies that you use. Create a modernization plan for all the subsystems that you use, indicating the points from the training manual. Conduct a budgetary assessment of the changes, apply the correction factors based on the influence of the methods, and you will receive a plan of priority actions.
    5. Do not forget about the connection of systems and track the mutual influence. And for this, start monitoring everything that your hands reach.

    And I almost forgot about the case from the header.

    Company X asked to calculate the budget for expanding the corporate data center to additional space. They needed to put 2 highly loaded racks. According to them, there was no physical place to place the racks in the operating hall, there were no cold reserves, the UPSs were working at 85% of the peak capacity and they were not enough. We figured the budget, it turned out that very bunch of money. Let's go watch the site. During the inspection, the following was revealed:

    • 1. In the hall with 40 racks, air distribution through the raised floor was used. At the same time, there was no air insulation system; in the cabinets there were many empty units not covered by caps. With the cooling capacity of the existing system, it has become more or less clear. At the same time, a solution to the physical placement problem appeared.

    • 2. We looked at the UPS logs and saw that the load on the UPS is growing at night. Logically, it should decrease, or remain the same plus or minus. It is very similar to creating backups, updating some databases or applications. However, it turned out that applications are updated only on weekends, the databases live on their own, and backups are being done in real time to another site for two years now. In theory. In practice, it turned out that some bad people did not decommission part of the infrastructure responsible for the reservation. In the same place, it was considered that turning off unnecessary iron we would get the necessary kilowatts.

    • 3. They asked the question: “Will you order an audit, or did you understand everything yourself?”. “Understood, understood,” they answered, and disappeared for a long time.

      After our conversation, the customer, with the help of 2 of his engineers, for a couple of weeks threw up a mess that had been accumulating for 2 years. Structures for insulation of cold corridors, caps in cabinets were ordered and manufactured. Redundant hardware was physically decommissioned, in the process they found several more unused servers. Tidy wires under the raised floor. As a result, they received their necessary kilowatts and units, even with a margin. Our costs amounted to 3 131 rubles. for gasoline and working hours. But we did not expose them to the customer, because it is uncivilized.

    And then they didn’t put up their high-load racks.

    Also popular now: