Graceful degradation. Report Yandex.Taxi

    Services need to be written in such a way that minimal functionality is always preserved - even if critical components are refused. Ilya Sidorov, head of one of the Yandex.Taxi backend product development teams, explained in his report how we give the user to order a car, when parts of the system do not work, and according to what logic we activate the simplified versions of the service.


    It is important to write not only services that work well, but also services that break well.

    - I am very glad to see you all. Today I will talk about graceful degradation. If you search for it in Yandex, you will most likely learn how to make your site work without JS. I will tell a little about another. About graceful degradation applied to backend.



    Let's start with the definition. How does it look in reality?



    This is where our Yandex.Taxi application is presented in the event that one of the services does not work — the service of selecting the destination point to which the driver should take you. As you can see, there is no big button “Order a taxi” on this screen, which means that the user will not be able to use the service. But you can try to degrade and allow the user not to choose point B.

    Then he will not be able to find out the exact price of the trip, we will not be able to build a route, but the user will have a button “Order a taxi” and he will be able to use our service. The main function of our application will be available. That's what I want to talk about today. How to properly degrade and what can be done with a service that has broken down.

    Performance plan I'll tell you how to degrade what to do with the service. You can turn it off, and even - to apply a different behavior. Then I will tell you how to understand when it is time to turn off our service. And at the end I will talk about a few nuances that we had to deal with when we made an automatic degradation system for Yandex.Taxi.

    What can be done with the service that is broken? You can turn off the functionality. If the prediction service for individual destination points does not work for you, then you turn off this service. If the chat between the driver and the passenger does not work, then you turn off the chat. If you can not order a car, then you turn off the button "Order a car" - oh, no, it does not work. Not all functionality can be turned off. And if you can not turn off something, then you need to take a different approach. For example, you can try to make a layout or simplified functionality. We call such simplified behavior in Yandex pumpkin - we say that the service has turned into a pumpkin.

    Consider these solutions in more detail.



    How to disable services? Perhaps you can make the right architecture. Suppose we have one monolithic service. If one of its parts fails, the whole service breaks down. But if we divide the service into parts so that customers will use different services for different requests, it will be much better.

    How will this work by example? There is a Yandex.Taxi service in which there are two main functions: order a taxi and chat with the driver. As long as we have one monolithic backend, if the chat with the driver fails, the basic functionality of ordering a taxi will be affected.





    What can you try to do? Divide the monolithic service into two parts. One part will be responsible for ordering a taxi, and the other - for communicating with the driver.

    Now everything looks much better. If the chat with the driver breaks, then everything else continues to work correctly.



    As you can see, the client uses different APIs, different requests to make an order and communicate with the driver.

    But in reality it seems that now everything is not so good, because there is a parasitic connection between the chat service and the service of orders. And it may happen that the order service uses a non-working chat service. In this case, the main functionality will not work.



    And in this case, everything is much better. Parasitic communication is gone, and now our services are really independent of each other. So, with the breakdown of the chat service, you can still use a taxi.

    The conclusion from this is the following: if you want to degrade using the service division, then it is very important to ensure that the services are independent of each other. This means that they must have different entry points, different endpoints. They must have different runtimes. And of course, they must use different databases. Otherwise, one broken service can break all other services along the chain.



    Well, we've figured out how to disable the functionality. Let's now see how to make the default functionality, how to make a pumpkin. On this screen, our destination point prediction service. The service uses a smart AI to predict the user the best for him at the moment the destination point. And if the AI ​​is tired, then we use the default behavior and suggest that the user leave Moscow.

    Let's see how this works in practice.



    We have a customer, he contacts the service of destination points and gets an error.



    Now two situations are possible. The first situation, if the failure was a single one, is simply one failed request. In this case, we just throw a mistake to the client, he will make a re-request and get his favorite destination points.

    But if the failure is massive, we turn on the pumpkin and the user gets the default behavior.



    But such hard-to-do behavior is much easier to implement, and this pumpkin is very reliable, therefore it allows us to work even in the case when the AI ​​fails. If we know that users often go to airports, we will not notice a strong deterioration in the lives of users.



    Even if the degradation mode is turned on, the pumpkin is turned on, but the user contacts the service and receives a successful response, we use this answer, not the pumpkin. And this behavior - when in the case of receiving a response, we use it, and in case of an error we use a pumpkin - we call the fallback mode.



    No error - successful response. There is a mistake - a pumpkin. We say that fallback has been included.

    I made out what could be done with the service that broke. You can turn off, and you can turn on the pumpkin. Let's now move on to the second part and figure out how to diagnose.

    We have two big questions to answer. The first is when you need to turn off the service and turn on the pumpkin. The second is when you need to turn off the pumpkin and turn back the service. Before we can answer these questions, one point needs to be clarified.



    In any complex system that interacts with a large number of agents, there must be some background error. In this slide, we see a real schedule of calls to one of our services. It comes to several thousand RPS, errors, we get a little less than 1%. Here is the logarithmic scale.

    Errors can be caused by different things. Maybe this is some kind of internal process, updating some databases or just background processes. Maybe customers go with the wrong requests, but the fact remains: we will always have a background of errors. Let's take it and move on.



    So, we use a solution based on statistics. We have a special database in which we save statistics, save the number of successful queries, the number of queries with errors and queries for which fallback was included. We take and accumulate statistics on our service in a certain period of time by sliding window. When in this sliding window the proportion of requests with errors exceeds a certain threshold, we include a fallback. And when the number of errors becomes less than the threshold, we turn it off.

    Pay attention to the highlighted areas. At 19:01, the first errors began to appear, but so far their share is rather small, and until 19:02 we do not include fallback. At 19:02 the threshold is exceeded, we included a fallback. At 19:08 the reverse process: the errors ended, but for some time the fallback was turned on, because the threshold in our sliding window is still exceeded. At 19:09 we turned off the fallback.

    We made out when to turn off the service. It is necessary to answer the second question: when to enable it. It's simple: we use the same solution based on statistics.



    It is important that we do not remove the load from the service, even if we have enabled the degradation mode. This is what allows us to continue to receive statistics, even if we show the user a pumpkin. Thus, we can determine that the errors have ended, the service has been fixed. So you can turn it back on full.



    When we talk about degradation, it is impossible not to say about monitoring. Good monitoring is half the success, half the way to automatic shutdown or automatic degradation. It is important for us to understand what problems arise with our service in general, what the nature of errors may be and how often they occur. And perhaps at the first stage we don’t even need a circuit breaker. Just if the monitoring light caught fire, we can take and turn off the service manually. When the monitoring light goes out, we turn on the service.

    If we do automatic degradation, automatic switch, then it is important to make monitoring on the fallback itself. If the degradation system works well enough, then users, in fact, may not even notice that something has broken in us. We ourselves can, if there is no monitoring, do not notice it. It is important to monitor the fallback, it is important to understand when it is turned on, when it is turned off, so that there is statistics and we can understand how long the functionality does not work, whether our backend becomes worse or better with time, depending on how much time we consider to spend the fallback .

    With the main part of everything.

    In the end, I would like to tell you a few nuances that we had to face when we were developing a system for automatic degradation in Yandex.Taxi.



    The first thing you should pay attention to is consistency. If you are doing automatic degradation for some service, it is important that the service responds consistently for all its clients. If you have two customers who use the service, it is important that the answers for these two customers in case of degradation are consistent. And if you have a service that participates in a long process, you need to understand: perhaps at the beginning and end of the process the service will work correctly, and fallback will be included somewhere in the middle.

    It sounds difficult, but let's try to explain with an example. Perhaps it will become clearer.



    Here is our chat between the driver and the passenger. The easiest way to degrade it is to disable it. Let's imagine that the chat is broken for the driver. What happens? The client will write to the chat, and the driver will not see the messages. Probably, they will be very unhappy, they will swear at our application when they meet with each other. In this case, it is important that the chat is either enabled at the same time or turned off at the same time for all participants in this chat. This is what I call consistency.



    The second caveat concerns the fact that our Yandex.Taxi application is geo-distributed: taxis can be ordered in Moscow, Krasnoyarsk or Helsinki. This has to be taken into account even when developing degradation systems. Imagine that we have a lot of successful requests and quite a few requests with errors. It would seem that this is a normal situation, the background of errors is always present. But you can look at the same picture differently.

    You can see that the service does not work in Mytishchi and you need to enable fallback for these users. The conclusion is: you need to build the correct statistics. For us as a geo-distributed service, this also means that we need to build statistics in the context of cities. If we make the statistics correctly, we will immediately see that most of the requests from Mytishchi break down, and we will enable fallback specifically for users from Mytishchi. And for all other users we will continue to work in normal mode, because for them the service works correctly.



    Perhaps for other services there will be other conditions and other nuances.

    Our services are becoming more complex. Often they depend on the outside world, which we cannot predict. Therefore, it is important to write not only services that work well, but also services that break well. If you have learned something new, then tell your colleagues, share. Like, cher, repost. Degrade properly.

    Also popular now: