How can I use interrupted Yandex.Cloud virtual machines and save on solving large-scale problems
Today we want to talk about such a useful feature of Yandex.Cloud as interrupted virtual machines. This is a special option that you can choose when creating a virtual machine to use computing resources at a reduced price. What is so special about interruptible virtual machines, why are they cheaper than regular ones, and in what cases is it wise to use them?
The capacities of Yandex.Cloud, and more precisely, the Yandex Compute Cloud infrastructure service , are noticeably greater than those used by users. By default, it is assumed that users should be able to scale arbitrarily. At least for these reasons, without taking into account other aspects, the available resources of the cloud platform significantly exceed current demand. It is at these free capacities that interrupted virtual machines are created.
Briefly, the nature of interrupted virtual machines can be described as follows: the service offers to use its free computing resources at a lower price, provided that these resources can be recalled at any time.
In general, interrupted virtual machines work like regular virtual machines, but they have a number of limitations:
In practice, in the vast majority of cases, interrupted virtual machines work out all 24 hours provided for by the service conditions. A forced stop, as a rule, occurs only when a large number of ordinary virtual machines are created in a specific availability zone in a short period of time: a new user appears with serious needs or current users are massively scaled.
At the same time, a stopped virtual machine can be started again: all data on disks is saved both during automatic and manual shutdown.
Limitations for interrupted virtual machines raise a logical question: how to apply them if resources can be revoked at any time? As an explanation, here are a few possible use cases.
Batch processing involves the parallel execution of a large number of resource-intensive tasks. This may be the conversion of file formats, image processing and recognition, ETL operations . The bottom line is that in batch processing there is a job queue and a whole set of work processes (executors) that receive jobs from the queue. If an individual executor running on an interrupted machine stops, the task will simply be transferred to the next executor. In other words, stopping one or even several virtual machines will not have a significant negative impact on the process and the result of processing.
When batch processing data, we are talking about using dozens of virtual machines. The use of intermittent machines provides very noticeable savings. Now one of the main consumers of productive discontinuous virtual machines with 32 cores is a long-time Yandex.Cloud client, Seismotech. Seismotek processes seismic data, which are necessary for the exploration of gas and oil fields. Seismic exploration involves working with large volumes of information. Data is processed in a batch method. The company simultaneously uses up to 60-plus interrupted machines: a total of up to 2000 vCPU and 4000 GB of RAM.
Hadoop is used to develop and execute distributed programs running on clusters of hundreds and thousands of low-cost nodes. The mechanisms for file replication and automatic restart of tasks performed on failed nodes provided by Hadoop ensure the stability of a distributed system to failures of individual machines. That is why where Hadoop is used, at least part of the nodes can be easily deployed to interrupted virtual machines. If they stop early, tasks will be sent to other nodes.
The continued availability of the web service can be ensured by using a cluster. A cluster consists of two or more servers. One of its tasks in the application to web services is to ensure stable operation at the time of peak loads. Typical examples: online shopping sites or sports sites where traffic growth is tied to specific dates. For stores, these can be traditional holidays or periods of discounts, and for sports-related sites, they can be days of events when live broadcasts, reviews and photo reports are published. At such moments, the volume of traffic can increase significantly.
The cluster must cope with the influx of visitors by distributing traffic to different nodes. For a period of sharp, but short-term increase in load, fault tolerance can be provided by adding servers on discontinued virtual machines. This option is inexpensive and does its job well. It is important to observe one condition: such a cluster must be hybrid, that is, include ordinary virtual machines. In this case, even the unlikely stop of interrupted machines will not lead to a service failure.
Kubernetes automates the deployment, scaling, and management of containerized applications across a large number of nodes. One of the main entities that can be called the building block of Kubernetes is under (pod). Pod provides launch of one or several containers on one node. A node for each hearth is selected and assigned by the Kubernetes scheduler. If a separate node with a running hearth fails, the scheduler will automatically transfer it to the node that is operating in normal mode. This performance management scheme assumes that part of the nodes can be hosted on interrupted virtual machines.
The practice of continuous integration is based on the frequent assembly and testing of the project. In this case, mainly automated testing is used. Schematically, it looks like this: a test environment is created on a virtual machine, the last build of the application is uploaded to it, automated testing is performed, the test results are uploaded, the virtual machine is deleted. As a rule, testing takes several tens of minutes, less often several hours.
Traditionally, the weak points of continuous integration are significant costs for supporting the integration process itself and the high demand for computing resources. From this point of view and taking into account the time frame of automated tests, discontinued virtual machines look more than suitable for continuous integration. They are much cheaper, and the likelihood of a car stopping immediately at the time of testing is vanishingly small. Moreover, even if the car is still stopped, the damage from the point of view of the business will be minimal.
Service Yandex Instance Groups allows you to automatically monitor the status of the whole group of interrupted VMs. He can independently create virtual machines with the given characteristics, maintain the necessary number of machines in the group, and restart interrupted instances if they stop. It doesn’t matter if a forced stop has occurred or 24 hours have passed since the start. Only one thing is important: a restart will occur if there are available resources. Yandex Instance Groups makes working with interrupted virtual machines more convenient, but cannot guarantee that there will necessarily be free capacity in a specific availability zone.
As we mentioned, interruptible virtual machines can reduce the cost of using computing resources. Inside Yandex, we started working on the implementation of a similar function several years ago. To divide computing tasks into guaranteed executable and interruptible, considerable investments were required. But it was not in vain: in the end, we increased the level of useful utilization of the server infrastructure from 30-40% to 70-80%.
Now similar capabilities are available to all Yandex.Cloud users at the click of a button. A simple example: if you transfer half of the used virtual machines with one hundred percent kernel load to interruptable format, you can save up to 35-40% of the budget.
At a reduced cost, CPU and RAM resources are available. Disk space and IP addresses are paid at regular rates. Here is what a simple calculation shows for the Cascade Lake platform.
If you wish, you can yourself compare the cost of using virtual machines in different modes using a calculator .
We hope we were able to bring a little clarity and give some useful examples in which cases you can use interruptible virtual machines to reduce the cost of computing resources without losing quality in performing tasks.
The capacities of Yandex.Cloud, and more precisely, the Yandex Compute Cloud infrastructure service , are noticeably greater than those used by users. By default, it is assumed that users should be able to scale arbitrarily. At least for these reasons, without taking into account other aspects, the available resources of the cloud platform significantly exceed current demand. It is at these free capacities that interrupted virtual machines are created.
Main limitations
Briefly, the nature of interrupted virtual machines can be described as follows: the service offers to use its free computing resources at a lower price, provided that these resources can be recalled at any time.
In general, interrupted virtual machines work like regular virtual machines, but they have a number of limitations:
- They are not covered by a service level agreement (SLA).
- The ability to create and run is not guaranteed.
- They can be forced to stop at any time. The probability of a stop is small, but non-zero, it can change over time and vary in different zones of Yandex.Cloud availability .
- An interrupted virtual machine cannot be made normal, but a regular interrupted one. The corresponding flag is set once and does not change.
- The machine will be surely stopped in a period not exceeding 24 hours.
In practice, in the vast majority of cases, interrupted virtual machines work out all 24 hours provided for by the service conditions. A forced stop, as a rule, occurs only when a large number of ordinary virtual machines are created in a specific availability zone in a short period of time: a new user appears with serious needs or current users are massively scaled.
At the same time, a stopped virtual machine can be started again: all data on disks is saved both during automatic and manual shutdown.
Use cases
Limitations for interrupted virtual machines raise a logical question: how to apply them if resources can be revoked at any time? As an explanation, here are a few possible use cases.
Batch processing
Batch processing involves the parallel execution of a large number of resource-intensive tasks. This may be the conversion of file formats, image processing and recognition, ETL operations . The bottom line is that in batch processing there is a job queue and a whole set of work processes (executors) that receive jobs from the queue. If an individual executor running on an interrupted machine stops, the task will simply be transferred to the next executor. In other words, stopping one or even several virtual machines will not have a significant negative impact on the process and the result of processing.
When batch processing data, we are talking about using dozens of virtual machines. The use of intermittent machines provides very noticeable savings. Now one of the main consumers of productive discontinuous virtual machines with 32 cores is a long-time Yandex.Cloud client, Seismotech. Seismotek processes seismic data, which are necessary for the exploration of gas and oil fields. Seismic exploration involves working with large volumes of information. Data is processed in a batch method. The company simultaneously uses up to 60-plus interrupted machines: a total of up to 2000 vCPU and 4000 GB of RAM.
Projects on Hadoop
Hadoop is used to develop and execute distributed programs running on clusters of hundreds and thousands of low-cost nodes. The mechanisms for file replication and automatic restart of tasks performed on failed nodes provided by Hadoop ensure the stability of a distributed system to failures of individual machines. That is why where Hadoop is used, at least part of the nodes can be easily deployed to interrupted virtual machines. If they stop early, tasks will be sent to other nodes.
Web Services Fault Tolerance
The continued availability of the web service can be ensured by using a cluster. A cluster consists of two or more servers. One of its tasks in the application to web services is to ensure stable operation at the time of peak loads. Typical examples: online shopping sites or sports sites where traffic growth is tied to specific dates. For stores, these can be traditional holidays or periods of discounts, and for sports-related sites, they can be days of events when live broadcasts, reviews and photo reports are published. At such moments, the volume of traffic can increase significantly.
The cluster must cope with the influx of visitors by distributing traffic to different nodes. For a period of sharp, but short-term increase in load, fault tolerance can be provided by adding servers on discontinued virtual machines. This option is inexpensive and does its job well. It is important to observe one condition: such a cluster must be hybrid, that is, include ordinary virtual machines. In this case, even the unlikely stop of interrupted machines will not lead to a service failure.
Projects at Kubernetes
Kubernetes automates the deployment, scaling, and management of containerized applications across a large number of nodes. One of the main entities that can be called the building block of Kubernetes is under (pod). Pod provides launch of one or several containers on one node. A node for each hearth is selected and assigned by the Kubernetes scheduler. If a separate node with a running hearth fails, the scheduler will automatically transfer it to the node that is operating in normal mode. This performance management scheme assumes that part of the nodes can be hosted on interrupted virtual machines.
Continuous Integration Testing
The practice of continuous integration is based on the frequent assembly and testing of the project. In this case, mainly automated testing is used. Schematically, it looks like this: a test environment is created on a virtual machine, the last build of the application is uploaded to it, automated testing is performed, the test results are uploaded, the virtual machine is deleted. As a rule, testing takes several tens of minutes, less often several hours.
Traditionally, the weak points of continuous integration are significant costs for supporting the integration process itself and the high demand for computing resources. From this point of view and taking into account the time frame of automated tests, discontinued virtual machines look more than suitable for continuous integration. They are much cheaper, and the likelihood of a car stopping immediately at the time of testing is vanishingly small. Moreover, even if the car is still stopped, the damage from the point of view of the business will be minimal.
Use in conjunction with other Yandex.Cloud services
Service Yandex Instance Groups allows you to automatically monitor the status of the whole group of interrupted VMs. He can independently create virtual machines with the given characteristics, maintain the necessary number of machines in the group, and restart interrupted instances if they stop. It doesn’t matter if a forced stop has occurred or 24 hours have passed since the start. Only one thing is important: a restart will occur if there are available resources. Yandex Instance Groups makes working with interrupted virtual machines more convenient, but cannot guarantee that there will necessarily be free capacity in a specific availability zone.
Economic indicators
As we mentioned, interruptible virtual machines can reduce the cost of using computing resources. Inside Yandex, we started working on the implementation of a similar function several years ago. To divide computing tasks into guaranteed executable and interruptible, considerable investments were required. But it was not in vain: in the end, we increased the level of useful utilization of the server infrastructure from 30-40% to 70-80%.
Now similar capabilities are available to all Yandex.Cloud users at the click of a button. A simple example: if you transfer half of the used virtual machines with one hundred percent kernel load to interruptable format, you can save up to 35-40% of the budget.
At a reduced cost, CPU and RAM resources are available. Disk space and IP addresses are paid at regular rates. Here is what a simple calculation shows for the Cascade Lake platform.
If you wish, you can yourself compare the cost of using virtual machines in different modes using a calculator .
We hope we were able to bring a little clarity and give some useful examples in which cases you can use interruptible virtual machines to reduce the cost of computing resources without losing quality in performing tasks.