Interview with Grigory Kornilov (Kaspersky Lab)
We are interviewing Grigory Kornilov, senior service manager at Kaspersky Lab . Gregory provides users with a computing infrastructure using the resources of cloud IaaS providers and is responsible for the compliance of this service with strict corporate SLAs.
Gregory will tell us about two years of experience using IaaS and how they came to the decision to use the resources of an external cloud.
Grigory, how long and under what circumstances did Kaspersky Lab turn its attention to cloud services?
The use of cloud services in the infrastructure began 5 years ago with a small project with a company that built a private cloud for us. However, the interaction with this company was carried out not so much as with the service provider, but as with the company that hosted our equipment and supported it without any obligations regarding the functionality of the virtualization environment. It turned out that we bought the equipment, placed it with a third-party company, and she became the operator of our equipment. The lack of formal legal responsibility for the functionality of the virtualization environment affected the attitude towards us and our requests. The motivation and speed of solving the emerging problems did not suit us.
This made us look towards service providers, i.e. those companies that, when working with us, will understand the width of their area of responsibility. At that time, we expected a fairly large expansion, and we did not want to increase the number of our administrators, buy our hardware, or buy sturdy seats. We made certain assessments and realized that the service provider would make a better or comparable offer, and the labor costs for infrastructure development would be minimal. I can also note from experience that services are usually getting cheaper, while their employees, on the contrary, are getting more expensive. Thus, it was decided to use the cloud infrastructure of the service provider, clearly regulating what we are ready to answer for and what the service provider should be responsible for.
Have you considered only Russian cloud providers or Western players including?
First of all, of course, we considered the Russian representatives, since we needed to organize high-quality network connectivity with our infrastructure. Having studied the market, company certification, the practice of providing cloud services, we realized that this segment of the IT services market is most developed in Moscow and St. Petersburg.
We looked at several European service providers, but did not find significant advantages for ourselves. In addition, when placing abroad, questions arise about network connectivity, speed, and mentality, the difference in time when the third line of specialists is in a different time zone. All these factors could adversely affect the level of service support. Therefore, preference was given to the Russian service provider. As for the cost, I would not say that services in the West are much more profitable.
Gregory, what is the reason for the tendency to increase the cost of employees?
It’s problematic to keep employees who support the solution on their own. These are experts of a very high level. The critical goal of our company is to support the business process of releasing updates, which should work 24/7. It turns out that we would have to invest in a round-the-clock support service, including high-class specialists. And the cost of highly qualified specialists in Moscow is only growing.
Gregory, have you considered renting a dedicated physical server? A lot of controversy and comparisons of rental of cloud and dedicated servers are still going on the Internet.
We did not consider the option of dedicated servers, because our main expenses are spent on system administrators who work not with servers, but with virtualization software. These are the most expensive specialists. Therefore, if you use the services of a third-party service provider, then certainly including the administration of the virtual environment.
We have sufficient experience in renting rack seats, acquiring equipment and its operation. We know how much the support line costs. Leaving the most valuable part to yourself is like not taking a step in the direction of service. Usually, really serious problems do not happen with equipment that is duplicated in units and components. Therefore, giving only the hosting provider to the service provider is inefficient.
Gregory, when considering cloud providers, what parameters did you pay attention to?
We have formed our vision of the service we want to receive. Right away shallow providers who disagree with our SLA (for example, the requested penalties) or with the distribution of areas of responsibility that we expected. We also dismissed service providers who did not agree with us on technical issues. Our users are used to working with certain tools within the framework of our own infrastructure, and therefore we have put forward a requirement for the provider to provide identical tools for working with the cloud. These are the highlights.
How many service providers refused to see your SLA?
Of the eight, one refused, he was not happy with the tools we requested and the required SLA.
What tools are we talking about?
We need access to manage resource pools and virtual machines directly in VMware vCenter. We determined that the service provider will be responsible for the overall virtualization environment, and we will be provided with truncated administrative rights to specific resource pools so that our administrators create new machines on their own, connect, and turn them off. The refused service provider insisted on using VMware vCloud Director, management through which is significantly different from the vCenter we already use.
I understand correctly that you only considered those service providers who also use VMware?
Yes. Our administrators who work with virtual machines (do not support virtualization, namely, work with virtual machines) are used to working with vCenter. Retraining them on a different interface, forcing them to work in different environments is an additional and unnecessary cost.
Gregory, how deep did you study providers? Have you checked the real ability to comply with the declared SLA?
We started the study of providers by checking the presence of their certification with the vendor, we looked at what partner status they have. We looked at the quality of the data centers on the basis of which the service will be implemented, studied the experience of the service provider with VMware solutions.
A very important point for us was the competence of the service provider for data storage systems, in our opinion this is the most difficult thing in the infrastructure. Failure of the server leads to a small downtime of some virtual machines, because the virtualization environment will restart them on the backup servers. And the failure of the storage can "put" almost all virtual servers at once. There are times when improperly installed software can lead to unavailability of the entire storage.
Usually one storage system is used, fully duplicated and fault-tolerant in the configuration from the manufacturer. In this case, the competence of the employees who manage and maintain this warehouse is extremely important. Moreover, there is not enough status confirming the large volumes of sales of the data storage system as a whole. It is necessary that there are certificates for storage administrators.
Have you paid attention to the level of storage, its configuration, types of disks?
Yes, we required a storage system of at least mid-range level. So that any operational work on updating the storage is carried out without any downtime. This was our prerequisite.
Based on our own practice of using a storage system, we already knew in advance what kind of disk subsystem configuration we approximately needed. This we also defined as a requirement. Service providers, for their part, offered additional benefits (caches, solid state drives).
What tasks did you take to the cloud?
Our company has a business process for issuing updates to the modules and antivirus databases of our antivirus products. This process is critical, the 24x7 constantly running pipeline ensures that the update is tested on all supported platforms and delivered to our users on time.
Does the entire staff of specialists releasing these updates work around the clock?
Yes, round-the-clock services of system administrators, developers and virus analysts are involved in the process. If this process stopped, then our products would provide less quality anti-virus protection, and this should not be allowed.
What does the resource request process look like for your internal customer?
There is no fundamental difference for the internal user from where he will receive the necessary resources. The user can send a request to our internal IT support service to expand some resource pool or to create a new resource pool. The user knows that the budget that our company will pay for resource pools will then be classified by user and internal billing will be carried out.
I also note that the internal user is warned that a certain level of resource expansion requires a certain amount of time. However, only an agreed contact person can contact an external service provider for additional resources.
How hard are the parameters you included in the service provider's SLA?
We proceeded from the fact that SLA with an external service provider should be stricter than our SLA in front of our internal users.
2 hours is the maximum downtime we allow for a service provider, given that this is not our only environment. There is its own computing power, which can compensate for the risks associated with downtime.
If the downtime is more than two hours, then there are already penalties. A fine in the amount of payment of 100% of the cost per month is achieved with a downtime of more than two days.
Have you summed up the results of your infrastructure in the cloud?
Of course, let down. We collected statistics on the availability of the service and discussed it with our internal business customer. The quality of service was rated at a solid five. The only wish was to implement a service from a second provider with the same quality in order to reduce dependence on one service provider.
How do you measure service availability? Do you use any monitoring tools?
Of course, the availability of the service is controlled by our monitoring systems, through them we report to the internal customer. Problems can arise not only at the service provider, but also at the junction of infrastructures or on our side. It is worth noting that our internal SLA is wider than the service provider's SLA.
Are you planning to migrate other services to the cloud?
Now we do not have such plans. I note that our own infrastructure is likely to remain with us. We need internal competence and a minimum level of independence.
Have you put forward additional requirements for the service provider, which we have not talked about?
Yes, for example, integration of service desk systems. We were not satisfied with the option in which someone had to go somewhere and do something to make an application. We determined the requirement for the implementation of the integration of IT service desk systems based on the exchange of e-mail messages of a certain format.
The user contacts the IT service desk of our company. In our system, a support specialist appoints an application for a specific group. Automatically, at the same time, a request for a coordinated mailbox is sent to the service provider. The service provider responds in a specific format so that its response is tied to the initial request in our IT service desk system and is immediately visible to our user. Thus, we accelerate the interaction between the service provider and the user, reduce the load on the first line of our support service.
Gregory, what advice can you give to colleagues from other companies that are now thinking about using clouds?
First of all, you need to coordinate with your security service what data you want to give to the side of the service provider. The security service will determine what can be given and what cannot. We conducted this risk assessment and introduced a certain restriction on the use of this service.
Interviewed by Sergey Chukanov, Development Director of IT-GRAD