
Temperature control in the data center: why is it sometimes possible and hotter
Today we’ll talk about data center cooling. A group of scientists from the University of Toronto published a study on the method of cooling data centers, in which the temperature is specially raised. We decided to figure out what the essence of this work was and analyzed the situation. / photo Emilio Küffer CC
Recently, a significant part of the energy consumed and carbon emissions are accounted for by data centers. Huge capacities are spent on their cooling, which was the main motivator for conducting research in the field of temperature control. An interesting fact is that it is not completely clear at what level it is necessary to maintain the temperature in data centers.

Most companies set the temperature recommended by the suppliers of the equipment used, but it is not clear how its increase affects the performance of the systems. At the same time, according to the results of studies, increasing the temperature by only 1 degree can reduce energy consumption by 2-5%.
It is for this reason that it was decided to conduct a study and answer the question of how to manage the temperature in data centers? To do this, an extensive set of data on production equipment was collected, which allowed us to study the effect of temperature on the performance of the equipment, including the reliability of the data storage subsystem, the RAM subsystem, and the server as a whole.
Although increasing the temperature in the data center seems to be the easiest way to save electricity and reduce carbon emissions, several problems arise here: one of them is a possible decrease in the reliability of the system. Unfortunately, there is very little detailed information about the effect of high temperatures on server performance; moreover, it is very contradictory.
According to some studies, it was found that every 10 ° C after 21 ° C increase the likelihood of electronics failure by 50%. Other works say that every 15 ° C doubles the failure rate of hard drives, and in a recent study, Google found that low temperatures, on the contrary, further harm storage devices.
With a rise in temperature in data centers, another problem arises related to a decrease in server performance. The fact is that when the temperature reaches a critical point, the processor enters the clock throttle mode (throttling), and the coolers begin to rotate at an increased speed - all this leads to additional power leaks and increased energy consumption.
Let's first pay attention to two special components of the hardware - these are hard disks and DRAM, because in modern data centers they are replaced most often.
LSE is one of the most common types of errors when individual disk sectors become inaccessible and the data stored on them is lost (if the system does not have redundancy and cannot restore them). 3-4% of all drives encounter LSE, and these numbers only grow as available capacities grow.
The reliability of the equipment is affected by a huge number of factors (load, humidity, voltage drops, device maintenance), we divided the results obtained for each model into data centers. It is quite obvious that with increasing temperature the likelihood of LSE also increases. However, the increase is much slower than standard estimation models suggest (for example, a model based on the Arrhenius equation) It is believed that there is an exponential relationship between temperature and the number of errors, which leads to a doubling of the failure rate for each additional 10-15 ° C.
Scientists conducted a statistical analysis and found that higher temperatures do not increase the number of LSEs if the disk is already exposed to LSE, and this tells us that the causes of errors in hidden sectors are the same for both cold and hot drives. At the same time, the frequency of LSE occurrence for one disk model can vary from the data center to the data center.
In the range we know, namely from 0 to 36 months, old drives have the same chance of colliding with LSE as new ones. Scientists measured the degree of reading load by the number of operations performed per month and assigned the disk to a group with a low degree of load if it [the number of operations] turned out to be less than the median for the presented data set (otherwise, to a group with a high load). Based on an analysis of the data, they stated that disk utilization does not affect the likelihood of LSE occurring with increasing temperature.
The purpose of this section is to consider how temperature affects disk failure rates. To get the most complete answer to this question, the influence of the workload was taken into account, as well as the differences between disk models and data centers. Based on data from 5 different models of storage devices collected from January 2007 to May 2009 and provided by 19 different Google data centers.
For temperatures below 50 ° C, the disk failure rate is growing much slower than classic models suggest. The increase in the number of failures with increasing temperature is insignificant. Following the same methodology as in the case of LSE, disk groups were divided by degree of load and age - as it turned out, neither one nor the other factor significantly affect the frequency of disk failures.
To study the effect of ambient temperature on server performance, scientists built a test bench with a thermal chamber. The thermal chamber was large enough to fit an entire server inside, and allowed us to control the temperature in the range from -10 ° C to 60 ° C with an accuracy of 0.1 ° C.
For the experiment, one of the most popular servers was chosen - Dell PowerEdge R710. It has a quad-core Intel Xeon 5520 processor with a frequency of 2.26 GHz, 8 MB L3 cache, 16 GB DDR3 ECC and runs on Ubuntu 10.04 Server with a Linux kernel 2.6.32-28-server. Hard drives (SAS and SATA) from different suppliers were connected to it.
In the course of the work, a series of load tests was carried out using microbenchmarks and macrobenchmarks designed to simulate the workload that real applications create. Benchmarks and techniques used: STREAM, GUPS, Dhrystone, Whetstone, random write / random read, sequential write / sequential read, OLTP-Mem, OLTP-Disk, DSS-Mem, DSS-Disk, PostMark, BLAST.
All SAS drives and one SATA drive (Hitachi Deskstar) show some decrease in performance at high temperatures: from 5-10% to 30%. Taking into account the fact that for all models the decline occurs in the same temperature range (and not at an arbitrary moment), and none of the disks reported errors, we can assume that the cause of degradation of performance is the inclusion of protective recording mechanisms devices.
An increase in the temperature of the air entering the electronic equipment can affect the amount of energy dissipated. Many IT firms begin to increase the speed of rotation of coolers when the ambient temperature reaches a certain threshold value.
Although the amount of energy consumed under various loads varies greatly, it begins to increase when the ambient temperature reaches 30 ° C, and increases up to 40 ° C. The growth of energy consumption is 50% - this is a lot.
Here we can say with confidence that the differences in energy consumption are associated with fans: an increase in rotation speed occurs at the same temperature values at which energy consumption increases. Thus, with increasing ambient temperature, the amount of energy consumed increases, which is mostly associated with an increase in the speed of rotation of coolers. Energy leaks are extremely small.
An increase in temperature in data centers can potentially save a huge amount of energy and reduce carbon emissions. Unfortunately, it is not completely clear what difficulties this is associated with, so many data centers try to keep the room temperature low. The temperature has a much lesser effect on the reliability of the equipment than expected: errors associated with DRAM and the failure of server nodes are weakly associated with high temperatures.
These encouraging results allow us to pay attention to other points related to temperature, for example, to increase the power consumption of individual servers with increasing temperature of the air entering them. During the study, it was found that this is due to an increase in the rotation speed of the cooling system fans. Power leaks in this case are completely negligible. Most of this energy is wasted because of poorly designed algorithms for controlling the speed of rotation of coolers.
However, not everything is so simple here that it would be possible to give some general recommendations or predictions about what the temperature in the data center should be and how much energy can be saved. The answers to these questions depend on too many factors related to the location of the data center and its purpose. However, we see that most organizations can “warm up” their equipment a little, without sacrificing system performance and reliability.

Most companies set the temperature recommended by the suppliers of the equipment used, but it is not clear how its increase affects the performance of the systems. At the same time, according to the results of studies, increasing the temperature by only 1 degree can reduce energy consumption by 2-5%.
It is for this reason that it was decided to conduct a study and answer the question of how to manage the temperature in data centers? To do this, an extensive set of data on production equipment was collected, which allowed us to study the effect of temperature on the performance of the equipment, including the reliability of the data storage subsystem, the RAM subsystem, and the server as a whole.
Foreword
Although increasing the temperature in the data center seems to be the easiest way to save electricity and reduce carbon emissions, several problems arise here: one of them is a possible decrease in the reliability of the system. Unfortunately, there is very little detailed information about the effect of high temperatures on server performance; moreover, it is very contradictory.
According to some studies, it was found that every 10 ° C after 21 ° C increase the likelihood of electronics failure by 50%. Other works say that every 15 ° C doubles the failure rate of hard drives, and in a recent study, Google found that low temperatures, on the contrary, further harm storage devices.
With a rise in temperature in data centers, another problem arises related to a decrease in server performance. The fact is that when the temperature reaches a critical point, the processor enters the clock throttle mode (throttling), and the coolers begin to rotate at an increased speed - all this leads to additional power leaks and increased energy consumption.
Temperature and Reliability
Let's first pay attention to two special components of the hardware - these are hard disks and DRAM, because in modern data centers they are replaced most often.
Temperature and errors in hidden sectors of the hard disk (LSE)
LSE is one of the most common types of errors when individual disk sectors become inaccessible and the data stored on them is lost (if the system does not have redundancy and cannot restore them). 3-4% of all drives encounter LSE, and these numbers only grow as available capacities grow.
The reliability of the equipment is affected by a huge number of factors (load, humidity, voltage drops, device maintenance), we divided the results obtained for each model into data centers. It is quite obvious that with increasing temperature the likelihood of LSE also increases. However, the increase is much slower than standard estimation models suggest (for example, a model based on the Arrhenius equation) It is believed that there is an exponential relationship between temperature and the number of errors, which leads to a doubling of the failure rate for each additional 10-15 ° C.
Scientists conducted a statistical analysis and found that higher temperatures do not increase the number of LSEs if the disk is already exposed to LSE, and this tells us that the causes of errors in hidden sectors are the same for both cold and hot drives. At the same time, the frequency of LSE occurrence for one disk model can vary from the data center to the data center.
In the range we know, namely from 0 to 36 months, old drives have the same chance of colliding with LSE as new ones. Scientists measured the degree of reading load by the number of operations performed per month and assigned the disk to a group with a low degree of load if it [the number of operations] turned out to be less than the median for the presented data set (otherwise, to a group with a high load). Based on an analysis of the data, they stated that disk utilization does not affect the likelihood of LSE occurring with increasing temperature.
Temperature and disk failures
The purpose of this section is to consider how temperature affects disk failure rates. To get the most complete answer to this question, the influence of the workload was taken into account, as well as the differences between disk models and data centers. Based on data from 5 different models of storage devices collected from January 2007 to May 2009 and provided by 19 different Google data centers.
For temperatures below 50 ° C, the disk failure rate is growing much slower than classic models suggest. The increase in the number of failures with increasing temperature is insignificant. Following the same methodology as in the case of LSE, disk groups were divided by degree of load and age - as it turned out, neither one nor the other factor significantly affect the frequency of disk failures.
The effect of temperature on performance
To study the effect of ambient temperature on server performance, scientists built a test bench with a thermal chamber. The thermal chamber was large enough to fit an entire server inside, and allowed us to control the temperature in the range from -10 ° C to 60 ° C with an accuracy of 0.1 ° C.
For the experiment, one of the most popular servers was chosen - Dell PowerEdge R710. It has a quad-core Intel Xeon 5520 processor with a frequency of 2.26 GHz, 8 MB L3 cache, 16 GB DDR3 ECC and runs on Ubuntu 10.04 Server with a Linux kernel 2.6.32-28-server. Hard drives (SAS and SATA) from different suppliers were connected to it.
In the course of the work, a series of load tests was carried out using microbenchmarks and macrobenchmarks designed to simulate the workload that real applications create. Benchmarks and techniques used: STREAM, GUPS, Dhrystone, Whetstone, random write / random read, sequential write / sequential read, OLTP-Mem, OLTP-Disk, DSS-Mem, DSS-Disk, PostMark, BLAST.
All SAS drives and one SATA drive (Hitachi Deskstar) show some decrease in performance at high temperatures: from 5-10% to 30%. Taking into account the fact that for all models the decline occurs in the same temperature range (and not at an arbitrary moment), and none of the disks reported errors, we can assume that the cause of degradation of performance is the inclusion of protective recording mechanisms devices.
Increase server power consumption
An increase in the temperature of the air entering the electronic equipment can affect the amount of energy dissipated. Many IT firms begin to increase the speed of rotation of coolers when the ambient temperature reaches a certain threshold value.
Although the amount of energy consumed under various loads varies greatly, it begins to increase when the ambient temperature reaches 30 ° C, and increases up to 40 ° C. The growth of energy consumption is 50% - this is a lot.
Here we can say with confidence that the differences in energy consumption are associated with fans: an increase in rotation speed occurs at the same temperature values at which energy consumption increases. Thus, with increasing ambient temperature, the amount of energy consumed increases, which is mostly associated with an increase in the speed of rotation of coolers. Energy leaks are extremely small.
conclusions
An increase in temperature in data centers can potentially save a huge amount of energy and reduce carbon emissions. Unfortunately, it is not completely clear what difficulties this is associated with, so many data centers try to keep the room temperature low. The temperature has a much lesser effect on the reliability of the equipment than expected: errors associated with DRAM and the failure of server nodes are weakly associated with high temperatures.
These encouraging results allow us to pay attention to other points related to temperature, for example, to increase the power consumption of individual servers with increasing temperature of the air entering them. During the study, it was found that this is due to an increase in the rotation speed of the cooling system fans. Power leaks in this case are completely negligible. Most of this energy is wasted because of poorly designed algorithms for controlling the speed of rotation of coolers.
However, not everything is so simple here that it would be possible to give some general recommendations or predictions about what the temperature in the data center should be and how much energy can be saved. The answers to these questions depend on too many factors related to the location of the data center and its purpose. However, we see that most organizations can “warm up” their equipment a little, without sacrificing system performance and reliability.