So, you are trying to assess the reliability of your cloud service

Published on February 27, 2014

So, you are trying to assess the reliability of your cloud service

    SLA (Service Level Agreement) is a form of guarantee of service reliability that is often found among service providers. Typically, an SLA is offered as an offer - and either you are satisfied and use the service, or are looking for another service. A typical wording is “industry leading 99.95% monthly uptime SLA”, which seems to suit most users.

    Typically, a potential user, after reading about the "99.95% monthly uptime SLA", is very pleased - the guarantee of no downtime for more than 21 minutes per month for 30 days sounds pretty promising.

    Everything is relatively simple, as long as you only consume the cloud service for your own needs. We looked at 99.95%, thought about no more than 21 minutes per month - were impressed and satisfied. What if you yourself create a service based on another service and decide which SLA you could offer?

    For example, an image processing service (suspiciously similar to the ABBYY Cloud OCR SDK ). What SLA can be offered to such a service? It would seem that you need to take all the dependencies on other services, carefully read their SLA, look at the number of nines, and decide how many nines after the decimal point you can write in your SLA.

    Suppose an image processing service runs on Windows Azure and uses the so-called web and worker roles from Azure Cloud Services to execute code and Azure Storage to store data. Fine. We open SLA on Cloud Services and we see there that TL; DR; the availability of role instances is guaranteed for 99.95% per month (provided that each role has at least two instances). We open SLA on Azure Storage and we see there that TL; DR; performance of at least 99.9% of storage requests is guaranteed. If the quality level does not correspond to the guaranteed, you need to contact support - and then the supplier will return part of the money.

    This was a very brief summary of the SLA of the two services indicated. If you use any of these services, you should carefully read and take into account all reservations.

    The following is fundamentally important: even in the worst case, a relatively small amount of money will be returned to you, which will cover ... but it will not cover anything, because it is tied to a fraction of the cost of the consumed service, and the cost of using cloud services is very low compared, for example, to employee remuneration who will contact the support of the service provider. The meaning of the SLA with the three nines is very simple: “dear users, this is a very reliable service, we try very hard, CARE and use, we will bill by the 10th of the next month,” SLA essentially sets expectations from the service, which is also very important. If availability were guaranteed during, for example, 15% of the time, expectations from the service would be fundamentally different.

    We return now to the question of what guarantees can be given to its users if the service substantially depends on another service with the SLAs described above. It seems that the availability of machines on which the code is executed is guaranteed for at least 99.95% of the time. Some of the accesses to the repository may fail, but talking about no more than one tenth of a percent is not scary, you will have to design the service so that unsuccessful repository operations are repeated several times with an increasing pause, and if the operation fails after several attempts, we will reset user request - if this does not happen too often, the user will be completely satisfied.

    Accordingly, after some meetings and two weeks of correspondence with everyone in the copy, we can multiply everything and decide what we can offer, for example, the service is operational for 99.9% of the time during the month. Having formulated such an SLA, we tell our users “our service is reliable, use it, everything will be fine, and if not, we will fix it very quickly, WITHOUT PANIC”.

    You publish such an SLA and after some time it is VERY UNEXPECTED ...

    ... you realize that you need to publish a correction of some extremely annoying error very urgently. Or you need to change the settings at the infrastructure level. Or the service itself realized that the load had increased, and decided that it was necessary to issue a command for scaling.

    For all these actions, some additional management service is used in the cloud infrastructure (you may be using a portal that runs on top of such a service, or a program that sends calls to such a service). This is a very important service, it is thanks to its existence that the clouds are so flexible and convenient to use. And this very important service is precisely at this very important moment, when it is very, very urgent that you need to do something, refuses to process your request.

    In numerous presentations, screencasts and instructions, you see how this service is used left and right when deploying new virtual machines, publishing a package with a service stuffing and many other operations. No one tells you one important thing: this service is your only opportunity to manage the cloud. As soon as something is wrong with the management service, you have potentially very big problems.

    We return to the wording of our SLA. Obviously, you need to somehow foresee the need for operations such as scaling and publishing updates, and take it into account in your SLA. And yes, our service seems to have to process a large (and unknown in advance) number of images from users quickly enough, and for this it should be able to scale. And these necessary operations require the use of an “auxiliary” management service.

    It is logical then to look at the SLA of this management service in order to understand what to expect from it.

    In Windows Azure, the Management API is used to manage the infrastructure (the management portal and cmdlets also work through it). So, open the SLA of the Management API service and ...

    ... but no, it will not be possible to get acquainted with this document, because it simply does not exist. And Amazon EC2 also does not have an SLA infrastructure management service.

    Wait ... OH SHI ~

    Yes, we just almost ignored the complete lack of SLA for the service, on which our service depends significantly. It's not just about code updates (which seem to be delayed, but in fact sometimes they need to be published very urgently) - the ability to scale is needed constantly.

    Why is there no SLA to the management service? One can only speculate.

    We can assume that it is not so easy to make the cloud management infrastructure reliable enough. It is one thing to promise that a particular virtual machine will be accessible over the network, it is one thing to promise that it will definitely be possible to scale to a few more nodes.

    Instead, we can assume that users do not consider the management service as an important service and are quite satisfied with the current SLA formulations for the “main” services.

    Alternatively, one can assume that, and another at the same time (and it is possible without bread) .

    In any case, cloud service providers still have room to develop their services, and users should be more careful about the dependencies of their own services. Otherwise, from the impressive number of nines after the decimal point, no use.

    Dmitry Meshcheryakov,
    product department for developers