Choosing a data center: what to look for

    Today, an increasing number of domestic companies are faced with the problem of selecting data centers that meet all the needs of their business - either for renting IT infrastructure, or for hosting and centralized maintenance of their own equipment. Of course, each company has its own data center reliability criteria. In some ways they are similar, in something they differ, but there is one general requirement: all components of the IT infrastructure must work stably, otherwise in the best case the company will function inefficiently, and in the worst, many business processes will simply stop.
    In this article I want to talk about what you need to pay special attention to when choosing a data center and what questions should be asked to get a fairly complete picture of the reliability level of the data center, without relying on the statement of the operator about compliance with Tier standards.
    The Tier classification itself provides four levels of data center reliability.

     
    Data Center Reliability Levels
    Data Center Availability
    Data center downtime per year
    Level I
    99.671%
    28.8 hours
    Level II
    99.749%
    22 hours
    Level III
    99.982%
    1.6 hours
    Level IV
    99.995%
    0.4 hours

     
    Of course, the best thing when choosing a data center is to seek the help of a consultant company, which will conduct the necessary audit of the sites you have chosen and make a conclusion about the suitability or unsuitability of a particular data center for your business. In Russia, this type of consulting is becoming increasingly popular, but in the vast majority of companies they still prefer to save on a service that is so important for business and conduct data center surveys on their own.

    Infrastructure Resiliency


    As a rule, most data center operators are limited to a general assessment of the level of fault tolerance of their facility, although often not all data center systems and subsystems have the declared redundancy scheme. Of course, in data centers that have successfully passed the Uptime Institute certification, the reliability level of all engineering systems is fully consistent with the established standard, but at the time of writing, only two data centers in Russia have officially certified projects (both in terms of Tier III fault tolerance) and implemented engineering solutions: this Data Center "South Port" of Sberbank and DataSpace. Although (and this is important to understand) in Russia, even the certification of the now respected American Uptime Institute does not guarantee the continuity of services, especially in the event of an accident. But this is a subject of a separate discussion, and today we will talk about hundreds of Russian data centers,
    In order to understand how much the reliability level of the data center corresponds to that declared by the operator, compile a table with a list of key components of the data center infrastructure and send it out to fill in the candidates you have selected.
     
    Below is a short list of questions that I recommend receiving answers from the data center operator.
     
    Architectural part :
    • the owner of the building (premises) in which the data center is located, the lease term;
    • floor load capacity;
    • finishing materials used in the decoration of walls and ceilings;
    • the presence of a freight elevator and a loading and unloading zone;
    • fire resistance of walls and doors.

    Power supply system:
    • the number of inputs from the transformer substation, capacity and categorization;
    • the number of inputs from different transformer substations and the volume of use of each;
    • the availability of diesel generator sets, power, operating hours without refueling, start-up time and time to reach full power, availability of fuel supply contracts, level of redundancy;
    • UPS availability, battery life, redundancy level;
    • scheme for connecting air conditioners to power.

    Air conditioning systems :
    • air conditioners used, manufacturer, quantity and level of redundancy;
    • temperature condition;      
    • smoke exhaust system and pressure relief valves.

    Automatic fire extinguishing system :
    • the presence of an automatic fire extinguishing system, type of extinguishing agent, the availability of reserves;
    • presence of a fire alarm system, number and types of sensors.

    Security systems :
    • availability of access control system;
    • the presence of a video surveillance system; 
    • access to the site.    

    Technical Support :
    • the number of specialists and engineers present at the site during working and non-working hours;
    • working hours of technical support staff;        
    • request response time;        
    • the presence of a multi-channel telephone, ticket system, web-interface.

     

    Emergency plan


    After receiving completed tables with a description of the infrastructure from all the data center operators you are interested in, you need to visit the facility and see everything with your own eyes. Before the visit, arrange in advance that among those accompanying you should be a competent representative of the operator’s technical service, able to answer most of your questions.
    During the tour do not hesitate to ask questions about the actions of the duty staff in the regular and emergency situations. Simulate various emergencies and ask to tell what on-duty engineers will do in these cases every minute during both working and non-working hours. This will help to understand how trained and trained the technical specialists of this operator are.
    An important condition for confirming the claimed reliability class is the presence of step-by-step instructions for the operator on duty personnel on emergency actions. Be sure to familiarize yourself with these instructions: this way you will understand in what time frames typical and non-standard emergency situations will be eliminated.
    Having visited many data centers as a potential customer, I regret to note that very little attention is paid to the preparation of such emergency plans by data center operators. Very few have the appropriate documentation, and even fewer of the operators for whom they are relevant and correspond to the staffing table.
    Ask if there is a round-the-clock technical support service at the facility, how many specialists are in it and what functional responsibilities are assigned to them. Most often, engineers who can perform only basic actions are on the site: press a button to restart the server, connect a KVM, and on-call specialists are called from home to solve more serious problems during non-working hours. As you know, this will increase the term for eliminating the accident at least while the competent employee will go to the data center.

    Emergency response exercises


    Of course, a flow chart of procedures that is not supported by practical experience is unlikely to be useful in the event of an emergency. Such documents should be constantly improved and updated in accordance with the results of comprehensive exercises and trainings for the prevention and elimination of emergencies, which are preferably carried out at least two to three times a year.
    Regular training of employees and simulations of various emergencies directly indicate the training of data center personnel and the responsible approach of the operator to operating the facility. If the data center operators who developed the actual regulations on personnel actions in emergencies are not too common, then it is even more difficult to find operators who conduct exercises constantly: many are limited to a test launch of a diesel generator set once a month.
    In recent years, the number of new data centers has been growing exponentially, and so far there are not many competent specialists with real practical knowledge in the field of data center operation in our country. Therefore, the owners of new sites sometimes try to acquire the necessary knowledge during operation, which inevitably leads to a shutdown of the data center. In Russia, for some reason, it is customary to cope with most problems on our own, and to attract professionals only when the emergency situation has already occurred.

     Preventative maintenance and repair


    Routine preventive maintenance of the infrastructure will minimize the risks of accidents.
    Make sure that the data center operator carries out the preventive work established by the regulations. To do this, ask you to familiarize yourself with the magazines in which all the events occurring in the data center are noted, as well as the measures for the routine maintenance of equipment are recorded. There should be several such magazines:
    1. Journal of the acceptance and acceptance of duty on the data center.
    2. Journal of visitor data center.
    3. Journal of deposit and removal of equipment and material values.
    4. Daily Examination Log, including sections:

    a) external examination of the technological equipment of the data center (doors, hatches, turnstiles, raised floor, technological platforms and corridors, the appearance of IT equipment);
    b) control of environmental parameters (temperature, humidity);
    c) control of energy consumption (fixing the readings of meters at the input and ammeters on the buses in phases);
    d) control of water flow (fixing meter readings at the input).
    1. The ITIS Datacenter technical maintenance logbook, which contains information on equipment malfunctions, on inspections, maintenance and repair of all infrastructure systems in accordance with its main components:

    a) a complex of security systems (KSB):
    - a security and alarm system (SOTS) - information about a planned (monthly) performance check, fixing false alarms during operation, marking the replacement of failed elements;
    - access control and management system (ACS) - fixing access failures and false positives during operation, marking the replacement of failed elements;
    - screening equipment (DT) - information about the planned (monthly) performance check, fixing false alarms during operation, marking the replacement of failed elements;
    - television surveillance system (STN) - marks on the replacement of failed elements;
    - Central Dispatch Post (DAC) - fixation of denial of service, marks on the replacement of failed elements;
    b) a complex of fire protection systems (KSPZ):
    - automatic fire alarm system (SAPS) - information about the planned (monthly) performance check, fixing false alarms during operation, marks on the replacement of failed elements;
    - a system of loud fire warning and evacuation control (SGA) - information about the planned (monthly) performance check, marks on the replacement of failed elements;
    - automatic gas fire extinguishing system (SAGP) - information about the scheduled (monthly) performance check, pressure monitoring data in the system, notes on refueling the IHL and on the replacement of failed elements;
    - smoke removal and air back-up system (SDP) - information on the planned (monthly) performance check, marks on the replacement of failed elements;
    - personal respiratory protection equipment (RPE) - information on checking the manufacturer's (monthly) seals on self-rescuers, replacement marks after the expiration date;
    c) a complex of communication systems, telecommunications (KSSTS):
    - structured cable system (SCS) - a log of cable connections and information about its changes;
    - electric clock system (MF) - marks on the replacement of failed elements;
    d) a complex of electrical equipment systems (CSE):
    - a system of protective and technological grounding (SZ) - data of a planned (annual) measurement of parameters, information about the drawing of connections (performed as necessary, but at least once a year);
    - a system of dedicated power supply (SVE) - temperature monitoring data of conductive buses, measurements of electrical cable parameters (insulation), information on the connection broach (performed as necessary, but at least once a year);
    - guaranteed power supply system (as part of the SVE system) - notes on PPR conducted by a specialized organization (outsourcing) according to its own maintenance schedule;
    - a backup power supply system (as part of the SVE system) - notes on PPR conducted by a specialized organization (outsourcing) according to its own service schedule;
    - the system of basic electric lighting (COO) - data on monitoring the parameters of illumination, marks on the replacement of failed elements;
    - system of emergency (standby) electric lighting (SAO) - data on monitoring of lighting parameters, marks on the replacement of failed elements;
    e) a complex of engineering systems (CITS):
    - precision conditioning (microclimate) in the data center (SPM) - data on temperature, humidity, pressure in the system, marks on replacing air filters, prevention of steam generators;
    - ventilation and conditioning system in the data center premises with permanent workplaces (ICS) - data on temperature control, air speed, pressure in the system, notes on replacing air filters, cleaning air ducts;
    - process water preparation system (SPW) - data on water quality control, marks on refueling filters with filters, filter replacement.

    Informing


    No matter how upscale the shift on duty is, no matter how well the various procedures for operating the data center have been worked out, emergency situations can still not be avoided. For you, as a customer, it is important to be timely informed of those accidents in the data center that could adversely affect the functioning of your equipment. Timely information will reduce the time to restore IT infrastructure. Find out what means of communication (Internet, telephones) are used, by what principle customers are notified, what information systems are used by the data center operator to do this, where they are located and how soon you will be notified of the emergency.
    It will not be superfluous to ask how the systems that inform customers are reserved, and how the notification will be implemented if the entire data center is de-energized.
     
    Of course, all of the above recommendations for choosing a data center require serious study and time-consuming, but after finding out all the nuances, you can most likely determine a good platform for hosting your IT infrastructure.
    I wish you success in this difficult search!

    Author: Aleksey Degtyarev, TsODy.RF magazine , issue No. 1

    Also popular now: