Boomburum August 9, 2010 at 14:57

About Intel Server Park, Clusters, and Data Centers

   I bring to your attention an interview with Sergey Kuznetsov , head of the IT department of Intel branches in Moscow and St. Petersburg. Sergey told many interesting details about his work and about the infrastructure of the company as a whole - the conversation turned out to be quite diverse and informative.

- Many of the "old-timers" of Intel IT Galaxy are familiar with you in the "server room", but it will be interesting for them, and not just new members of the community, to learn about your new job and position. Please tell us a little about your unit.

- Intel provides a certain career growth and rotation for all its employees. In particular, I used to support the laboratory in the Moscow office and was the technical manager of the innovation center. Currently, I am the head of the IT department of the Moscow and St. Petersburg offices of Intel. The responsibility of this department includes the coordination of global programs of the corporation on these local sites and their operational performance, i.e. everyday work. My responsibilities also include planning IT resources, developing sites in accordance with the corporate strategy and with the needs of local business units, improving the efficiency of using server equipment available on sites,

- The range of tasks is impressive. How is it working in a new position, what do you say about the ways to increase the efficiency of your work?

- After moving to the post of head of the IT department, I, of course, felt a change in the load and the range of tasks that had to be addressed. If you are responsible for some kind of service, then the scope of your responsibility is relatively small and limited to this very service. As soon as you come to the leadership of a group of people and a group of services, it turns out that you are responsible not only for the productivity of the department itself, but also for planning its activities, for the reliability of the services provided by the department, for the efficiency of the people who support these services, and for many of the operational issues mentioned above. Along with technical, many diplomatic tasks also have to be addressed. From the point of view of a technical specialist, the duties of a leader are more diverse and it is necessary to work a lot.

   As for the effectiveness of my work, when I was responsible for a certain service or for supporting a group of users, it was quite enough that I performed my daily functional duties. As soon as you become responsible for a serious area of work, for a branch, personal effectiveness becomes small and you have to keep several notebooks, noting ongoing projects on the site, very tightly monitor your personal calendar, current tasks and requests from various groups of users of our services, so that one request did not go unnoticed. And here, as never before, it is discipline that is important in fixing all the arising tasks.

- The load is big, is there enough time for everything? Do you and your employees have to work at night?

- I think I will not reveal the big secret that the employees of the IT department at Intel, like in many other companies, work according to an irregular schedule. If some service falls, of course, we do not leave users in trouble. We are obliged to ensure the work of the services provided at any time of the day or night. In order to deal with such situations faster, we have emergency recovery plans that we follow regardless of when the incident occurred. At the same time, Intel has a concept of a balance between work and personal life of employees, and we try to have a good rest. For example, we arrange team events. And if some employee was busy in the afternoon, setting up an important service, then, in agreement with his leadership, he can come back the next day,

- Let's talk about the structure of the Intel server park, about those servers that are in the area of responsibility of your department.

- Unlike many large corporations engaged in the production and marketing in the consumer market, Intel is an R & D company (Research & Development). Our company’s business is based not only on production, but also on research activities. Accordingly, in the Intel server fleet, the number of machines intended for research and various computing significantly exceeds the number of infrastructure servers that support business and support the corporate IT environment. We have three key server segments: global servers responsible for global infrastructure, local infrastructure servers that provide users with work in separate branches, and servers intended for research activities. The latter are also divided into two categories: Computing servers, the so-called compute servers, and servers for interactive work and for measuring the performance of various applications are performance servers. There are also servers related to production, but there are no factories in Russia, so let us leave them aside. Global functions such as e-mail, Internet, IM services, SharePoint infrastructure, Project servers, SAP services, and business process support are all assigned to global servers. And then, on each local site, we have groups that need servers to support their research activities. These are version control systems and program code quality control systems, local database servers, servers for supporting local Web applications, and service systems. and servers for interactive work and for measuring the performance of various applications - these are performance-servers. There are also servers related to production, but there are no factories in Russia, so let us leave them aside. Global functions such as e-mail, Internet, IM services, SharePoint infrastructure, Project servers, SAP services, and business process support are all assigned to global servers. And then, on each local site, we have groups that need servers to support their research activities. These are version control systems and program code quality control systems, local database servers, servers for supporting local Web applications, and service systems. and servers for interactive work and for measuring the performance of various applications - these are performance-servers. There are also servers related to production, but there are no factories in Russia, so let us leave them aside. Global functions such as e-mail, Internet, IM services, SharePoint infrastructure, Project servers, SAP services, and business process support are all assigned to global servers. And then, on each local site, we have groups that need servers to support their research activities. These are version control systems and program code quality control systems, local database servers, servers for supporting local Web applications, and service systems. but there are no factories in Russia, so let us leave them aside. Global functions such as e-mail, Internet, IM services, SharePoint infrastructure, Project servers, SAP services, and business process support are all assigned to global servers. And then, on each local site, we have groups that need servers to support their research activities. These are version control systems and program code quality control systems, local database servers, servers for supporting local Web applications, and service systems. but there are no factories in Russia, so let us leave them aside. Global functions such as e-mail, Internet, IM services, SharePoint infrastructure, Project servers, SAP services, and business process support are all assigned to global servers. And then, on each local site, we have groups that need servers to support their research activities. These are version control systems and program code quality control systems, local database servers, servers for supporting local Web applications, and service systems. who need servers to support their research activities. These are version control systems and program code quality control systems, local database servers, servers for supporting local Web applications, and service systems. who need servers to support their research activities. These are version control systems and program code quality control systems, local database servers, servers for supporting local Web applications, and service systems.

   A few years ago, Intel organized a working group to develop strategies and solutions for optimizing server resources, based on information about existing data centers and models for their use on all sites. Small data centers were consolidated into larger ones, and small sites got the opportunity to work with them over data networks. It should be borne in mind that for small representative branches, where only employees of sales and marketing departments are present, but there are no large research groups, separate data centers are not created. Thus, over the past few years, we have moved from a focus on consolidating data centers to a strategy focused on evaluating and optimizing the value of a local data center for business.

   In Russia, for example, one data center was organized on each of the sites. We cannot do without them, since in each of Intel's branches in Russia (research and development branches, we are not talking about sales offices), serious developments are underway that require a large amount of work with locally located servers - interactive, computational, and measuring performance software. However, at the moment, IT is actively conducting research activities to find out the requirements and possibilities of working with remotely located research servers. In particular, several groups in Russia are already using computing resources located at remote sites.

- Intel IT Galaxy community members already know that the winners of the "3 days with IT @ Intel ”you will have the opportunity to get acquainted with the Intel data center in Nizhny Novgorod. What can they see there, is this really a serious modern data center?

- The strategy for the development of data centers at Intel implies that each data center is organized and equipped with the latest technology, serious investments are being made in its creation. Therefore, any of our data centers is a serious solution, incorporating industrial-scale ventilation, an infrastructure for ensuring uninterrupted power supply (both based on batteries and diesel generators, which can allow you to hold out for an arbitrarily long time while the fuel is being delivered). We always evaluate the place where it is planned to place a data center, in terms of the effectiveness of the implementation of power and cooling systems.

   The state of systems located in the data center is constantly monitored, the temperature is optimal in terms of energy consumption (it is not so low that extra money was spent on cooling, but it is absolutely safe for the functioning of servers and other equipment). By the way, the borders of the temperature regime are quite narrow, which in itself is evidence of the competent organization of the data center. In less efficiently organized data centers, it is impossible to maintain such a narrow framework for temperature changes. I want to emphasize the use of hot & cold aisle technology, when the racks are oriented in the direction of air outlet from them and the air flows are organized in such a way that the cooling is as efficient as possible. Many factors are considered, The physical distribution of servers on racks is optimized in terms of power consumption, phase load, space occupied by space. For example, for security, heavy (by weight) servers are not located at the top of the rack with the bottom unloaded, otherwise the rack will be unbalanced and may tip over even if there is a slight jolt (say, during an earthquake) if the rack fixing fixtures are damaged.

   Our data centers always provide a high degree of operational safety. Access there, of course, is carefully monitored, employees undergo a series of security training - regarding the equipment of the data center, the data contained therein, owned by the intellectual property company, on the one hand, and, on the other, the physical safety of the data center personnel, the security of ongoing work.

- How many servers are in your area of responsibility, are there any clusters?

- The number of servers on each site is determined by the needs of the local business. Usually this is a fairly small number, several dozen, infrastructure servers that support the functioning of the business and are responsible for various IT functions. In addition, depending on the number of engineering research groups on the site and the nature of the tasks they solve, there is a significant number of research servers, their number can reach several hundred or even thousands.

   The server equipment available in Russia is also used for resource-intensive computing. Of course, to increase the efficiency of use, computers are combined in clusters, including those based on blade systems. If necessary, such solutions can be applied within each branch, but recently there has been a tendency to use remote resources. For example, we have created a very powerful cluster for computing in Nizhny Novgorod, and for some resource-intensive batch computing it is preferable to work with it. Due to the fact that we are trying to download a large computing pool by batch computing from other sites, we are able to achieve a sufficiently high utilization of resources placed there.

   But geographic consolidation of resources does not eliminate the need for local data centers, because the latency of the WAN channel so far remains too high for remote execution of interactive applications. Remote servers for interactive research work are much harder to use, users experience discomfort even with delays of 100 ms or more. The amount of local work does not always allow using the server capacities as efficiently as possible, therefore, for interactive servers in laboratories, measures are currently being taken to increase their energy efficiency - such as, for example, automatically shutting down unused machines at night, consolidating low-power servers.

- How often is the server park updated, how exactly does this happen? How does the introduction of new platforms and technologies affect the number of servers?

- Intel is implementing a strategy that envisages a four-year cycle of using servers. In the first quarter of its life cycle, new equipment is installed in the data center and current services are migrated to it. Next is the normal operation of the servers. Somewhere at the end of the third year of the server’s life, planning for its decommissioning begins. In the last quarter of the fourth year of the life cycle, new equipment is installed to replace the old systems, and it is planned to transfer services, migration.

   One of the interesting points is how we choose the architectures and servers to be used. Each year brings new technologies, new models of server hardware, new brands appear. Each year, service owners for certain projects study new platforms, equipment from various manufacturers, and conduct comparative tests. As a result, server models and configurations are selected that are approved as “corporate platforms” for organizing certain services. In other words, the selected optimal configurations become recommended for the purchase and deployment of appropriate services throughout the year. After a year, the procedure is repeated.

   With regard to improving the efficiency of use of computing equipment in the company - this is another interesting topic. The work here goes in two directions. First, we are reducing the number of servers by increasing their computing power. Suppose that if our servers are busy with serious computing work, then we need adequate computing power for these tasks. And the fewer “iron” servers will perform the same amount of work, the better it will affect energy consumption, cooling equipment, occupied by them the volume of premises. Now we are actively buying and deploying servers with new Intel Xeon processors, which are very effective in terms of consolidating computing resources, and we are replacing them with four-year servers with a ratio of about 1:10.

   Secondly, the question arises, what to do with infrastructure servers? The fact is that for the most part, infrastructure servers do not fully utilize the full power of modern processors. For example, these are file servers that are usually loaded with I / O, actively work with disk arrays, or hosting servers that are not very seriously used.

   The company, of course, seeks to increase the efficiency of use of such equipment. For this, virtualization is used, i.e. we take one powerful machine based on the new Intel Xeon processors and raise several virtual servers on it. Moreover, if we take the exact same machine, deploy similar virtual servers on it and combine all this into a cluster, then we get a failsafe system. Even if one “iron” system fails, the virtual servers still continue to work, if the virtual machine crashes, we can easily either restore the infrastructure server from the image stored in our system, or its functions are transferred to another virtual server. In addition to virtualization, consolidation of services on one machine is used. If any of our services is used by one working group and not very intensively, we interview other workgroups and, for example, we load the Web server for hosting with services from other workgroups. If the services do not overlap, then we simply install additional services on the same physical server without virtualization to increase processor efficiency, where appropriate. We have certain infrastructure servers that we have not yet virtualized, because with regard to them this technology is still being tested and so far it has been decided in terms of the efficiency of their use to leave some services on physical servers. then we simply install additional services without virtualization on one physical server to increase processor efficiency, where appropriate. We have certain infrastructure servers that we have not yet virtualized, because with regard to them this technology is still being tested and so far it has been decided in terms of the efficiency of their use to leave some services on physical servers. then we simply install additional services without virtualization on one physical server to increase processor efficiency, where appropriate. We have certain infrastructure servers that we have not yet virtualized, because with regard to them this technology is still being tested and so far it has been decided in terms of the efficiency of their use to leave some services on physical servers.

   The infrastructure we support assumes maximum stability, compatibility with current applications and reliable use. Special groups select all available solutions, test them for compatibility with existing infrastructure services and approve certain models and configurations as the recommended corporate platform for certain services.

- Is virtualization used on computing servers?

- When we talk about infrastructure, we take into account that the company's business, the livelihoods of offices, the performance of our programmers and developers depend on infrastructure servers. When it comes to laboratories, in most cases both for IT and for developers they serve as a kind of testing ground. And, of course, we run the most advanced solutions, first of all, in our laboratories. In laboratories, we are pleased to use the latest research in the field of virtualization. For batch processing, virtualization is not very interesting within a single branch. Here it must be understood that a certain amount of resources is spent on maintaining the viability of the cloud itself, and without virtualization, we can use all the computing power to process the data. However,

- The process of updating the server fleet is also related to issues of energy efficiency and energy saving. Apparently, the transition to new platforms affects noticeably?

- Our infrastructure services have processes with different computing capacities. For processes with high resource consumption, we use high-performance servers based on top-end processors of the latest line and thanks to this we consolidate less-efficient servers. As a result, we get maximum productivity per unit area of the data center. For less resource-intensive infrastructure server applications, we consider purchasing machines with maximum energy-saving technologies in our Nehalem line, which allow us to achieve the best performance per watt of energy consumed. In addition, for example, we are now exploring the possibilities of not only increasing the utilization of equipment, but also automatically turning off the power of servers, which, due to the specifics of our business processes, are not loaded with work at night.

- Does Intel use technology “data center in a container”, is there a practice of deploying mobile data centers to provide temporary peak demand for computing resources?

- We have studied the use of the so-called “data centers in containers”, including the question of how these solutions are suitable for working in a real corporate environment, with our real infrastructure. A number of tests were carried out, but their final results have not yet been reported. However, there is no doubt that this technology is very interesting and can be useful if it is necessary to urgently deploy data centers in a new area or during natural disasters. In particular, the “data center in the container”, which was used by our engineers to evaluate its effectiveness, was sent to Haiti to the Emergency Control Center as humanitarian aid.

- We touched on the topic of disasters ... Unfortunately, even major accidents are becoming more and more likely. Tell us how important resources and data are reserved in Intel.

- I don’t know about plans to use container data centers if in the event of some disaster our data centers in Russia fail. Now, if we talk about more realistic scenarios, about reserving servers, services and systems of data centers, then we certainly have such opportunities. We have redundant power supply and ventilation systems. Each infrastructure server and service has an owner who regularly prepares and updates plans for the recovery and mitigation of the consequences of accidents, where it signs what should be done if its server or service fails.

   All server infrastructure equipment in Intel data centers is serviced by manufacturers. We buy servers with service, which allows us to be calm for the functioning of their hardware. As for the troubles that can be caused by various conflicts in the systems, our company takes the reservation of services and data very seriously. We have well-established and proven long-term experience schemes for restoring working capacity after various incidents, everyone knows what and in what cases he needs to be done, relevant trainings are regularly held.

- The greater the processing power, the more data is processed and they need to be stored somewhere. Can you tell us something about Intel's data warehousing?

- Yes, this is one of the important components of any data center. As recommended by technology, the file array should be a separate specialized device, the main function of which is data storage.

   Such storages, of course, exist in each of our data centers. De facto, these are large RAID arrays that are configured in specialized devices, which are disk shelves connected by high-speed optical data transmission channels to the control device. The latter is equipped with network interfaces of significant bandwidth and has the processing power sufficient to handle a very large number of file requests. If not specialized devices were used here, but ordinary servers full of disks, then we would never be able to achieve the same high speed of processing requests and the same level of reliability. However, specialized solutions of an industrial level are not only very high-performance, but also very expensive. They have a closed architecture. Actually it is a conglomerate of hardware solutions, architecture, interfaces, and its specialized operating system. Thanks to the close coordination of these components, high performance and reliability are ensured, but the price is corresponding.

   In terms of practice, of course, I would like to reduce the cost of storing data. For example, our research groups often request additional storage space for their information, but due to the high cost of solutions of this class, we cannot always satisfy their needs, because we have to weigh needs and available budgets. To store less important data, we can afford to use ordinary servers with disk arrays. Of course, it is of interest to us to analyze the possibility of using different storage systems, including those that will be built on our new Intel Xeon C5500 and C3500 processors.

   About backup systems is also an interesting question. It uses special software and hardware systems that can control more than a hundred cassettes. The company has a policy of off-site storage of backup copies, therefore, even in the event of an accident in some of the branches with a complete failure of the data center, information can be restored by taking a backup from an external storage. Yes, information volumes are growing, and simply building up capacities and the number of cassettes cannot be dealt with. Therefore, we strive to optimize the data to be backed up. For example, in agreement with the owners of the information as a result of the risk analysis, IT may not reserve intermediate results of calculations, other process information, the lifetime of which is measured in several days.

- Among the questions asked before the interview, the topic of interserver connections was discussed, which can also become a bottleneck with the growth of computing power and the density of servers. Is the problem being solved?

- Of course, there is a problem, there are some ways to solve it. For example, for a long time already, for serious cluster computing, we have used blade systems that neutralize the bottleneck in interserver connections due to the presence of our own interface for exchanging data between servers. Also, in all the servers we purchase, by default there are two gigabit ports. I can say that even with the active use of servers, we still will not be able to create such a powerful data stream so as to fill two gigabit ports on one server. As for storage systems, as we said, they use specialized solutions with much higher bandwidth.

- Thanks for the informative conversation!

   Well, that’s all :) Soon we’ll publish something else interesting! In the meantime, the heat, I urge everyone to participate in the "summer" Intel contest , with pleasant prizes in the summer.
Good luck!

Tags:

About Intel Server Park, Clusters, and Data Centers

Also popular now: