Fine-tuning load balancing

    This article will focus on load balancing in web projects. Many believe that the solution to this problem is to distribute the load between servers - the more accurate, the better. But we know that this is not entirely true. The stability of the system is much more important from a business point of view .

    A small minute peak at 84 RPS "five hundred" is five thousand errors that were received by real users. This is a lot and it is very important. It is necessary to look for the reasons, to work on the errors and try to avoid such situations in the future.

    Nikolay Sivko ( NikolaySivko ) in his report on RootConf 2018 spoke about the thin and not very popular aspects of load balancing:

    • when to repeat the request (retries);
    • how to select values ​​for timeouts;
    • how not to kill the underlying servers at the time of the accident / overload;
    • do you need health checks;
    • how to handle flickering problems.

    Under the cut of this report.

    About Speaker: Nikolay Sivko co-founder He worked as a system administrator and group administrator. Supervised the operation in Founded monitoring service In this report, experience in monitoring development is the main source of case studies.

    What are we going to talk about?

    This article will discuss web projects. Below is an example of live production: the graph shows requests per second for a web service.

    When I talk about balancing, many people perceive it as “we need to distribute the load between servers — the more precisely, the better.”

    In fact this is not true. This problem is relevant for a very small number of companies. More often business is concerned with errors and stability of the system.

    The small peak on the chart is “pyatisotki”, which the server returned within a minute, and then stopped. From a business point of view, such as an online store, this little peak at 84 RPS "five hundred" is 5040 errors for real users. Some did not find something in your catalog, others could not put the goods in the basket. And this is very important. Suppose this peak on the graph does not look very large, but  in real users it is a lot .

    As a rule, everyone has such peaks, and administrators do not always react to them. Very often, when a business asks what it was, he is answered:

    • "This is a short burst!"
    • "This is just a release rolling."
    • "The server is dead, but everything is in order."
    • “Vasya switched the network of one of the backends”.

    Often, people do not even try to understand the reasons why this happened, and they do not do any post-work so that this does not happen again.

    Thin tuning

    I called the report “Fine Tuning” (English Fine tuning), because I thought that not everyone would get to this task, but it would be worth it. Why not get?

    • Not everyone gets to this task, because when everything works, it is not visible. This is very important in case of problems. Fakapy do not happen every day, and such a small problem requires very serious efforts to solve it.
    • Need to think a lot. Very often the admin - the person who sets up the balancing - is not able to solve this problem on his own. Next we will see why.
    • Cling lower levels. This task is very closely connected with the development, with the adoption of decisions that affect your product and your users.

    I argue that it’s time to do this task for several reasons:

    • The world is changing, becoming more dynamic, there are many releases. It is said that it is now correctly released 100 times a day, and release is a future fix with a probability of 50 to 50 (just like the probability of meeting a dinosaur)
    • In terms of technology, everything is also very dynamic. Kubernetes and other orchestrators appeared. There is no good old deployment, when one backend on some IP is turned off, the update rolls in, and the service rises. Now in the rollout process in k8s, the list of IP upstream is completely changing.
    • Microservices: now everyone communicates over the network, which means you need to do it reliably. Balancing plays an important role.

    Test stand

    Let's start with simple obvious cases. For clarity, I will use the test bench. This is a Golang application that gives out http-200, or you can switch it to "give up http-503" mode.

    Run 3 instances:


    We serve 100rps via yandex.tank via nginx.

    Nginx out of the box:

    upstream backends {   
    server {    
      location / {
        proxy_pass http://backends;

    Primitive script

    At some point, we turn on one of the backends in a 503 give mode, and we get exactly a third of errors.

    It is clear that nothing works out of the box: nginx out of the box does not retry if received any response from the server .

    Nginx default: proxy_next_upstream error timeout;

    In fact, this is quite logical on the part of nginx developers: nginx is not entitled to decide for you what you want to retry and what not.

    Accordingly, we need retries - repeated attempts, and we start talking about them.


    We need to find a compromise between:

    • User request - holy, hurt yourself, but answer. We want to answer the user at all costs, the user is the most important.
    • It is better to answer with an error than server overload.
    • Data integrity (with nonidempotent queries), that is, you cannot repeat certain types of queries.

    The truth, as usual, is somewhere between - we have to balance between these three points. Let's try to understand what and how.

    I divided the failed attempts into 3 categories:

      1. Transport error
    For HTTP, the transport is TCP, and, as a rule, here we are talking about connection setup errors and connection setup timeouts. In my report I will mention 3 common balancers (let's talk about Envoy a little further):

    • nginx : errors + timeout (proxy_connect_timeout);
    • HAProxy : timeout connect;
    • Envoy : connect-failure + refused-stream.

    Nginx has the ability to say that a failed attempt is a connection error and a connection timeout; HAProxy has a connection timeout, Envoy also - everything is standard and normal.

      2. Request timeout:
    Suppose we sent a request to the server, successfully connected to it, but the answer does not come to us, we waited for it and we understand that there is no point in waiting any longer. This is called request timeout:

    • In  nginx has: timeout (prox_send_timeout * + proxy_read_timeout * );
    • In  HAProxy  - OOPS :( - it basically no Many do not know that HAProxy, if successfully established a connection, will never try to re-send the request..
    • Envoy can do everything: timeout || per_try_timeout.

      3. HTTP status
    All balancers, except HAProxy, are able to process, if the backend responded to you, but with some erroneous code.

    • nginx : http_ *
    • HAProxy : OOPS: (
    • Envoy : 5xx, gateway-error (502, 503, 504), retriable-4xx (409)


    Let's talk now in detail about timeouts, it seems to me that it is worth paying attention to this. There will be no further rocket science - this is simply structured information about what generally happens and how it applies to it.

    Connect timeout

    Connect timeout is the time to establish a connection. This is a characteristic of your network and your specific server, and is independent of the request. Usually, the default for connect timeout is set to small. In all proxies, the default value is large enough, and this is wrong - there should be one, sometimes tens of milliseconds (if we are talking about a network within the same DC).

    If you want to identify problem servers a little faster than these units, tens of milliseconds, you can adjust the load on the backend by setting a small backlog to accept TCP connections. In this case, you can, when the backlog of the application is full, tell Linux to reset the backlog overflow. Then you will be able to shoot a “bad” overloaded backend a bit earlier than the connect timeout:

    fail fast: listen backlog + net.ipv4.tcp_abort_on_overflow

    Request timeout

    Request timeout is not a network characteristic, but a  characteristic of a group of requests (handler). There are different requests - they are different in severity, they have completely different logic inside, they need to apply to completely different repositories.

    Nginx, as such, does not have a timeout for the entire request. He has:

    • proxy_send_timeout: time between two successful write operations write ();
    • proxy_read_timeout: time between two successful read operations read ().

    That is, if you have a backend slowly, one byte one time, it gives something to the timeout, then everything is fine. As such, nginx has no request_timeout. But we are talking about upstream. In our data center, they are controlled by us, so with the assumption that there is no slow loris in the network, then, in principle, read_timeout can be used as request_timeout.

    Envoy has it all: timeout || per_try_timeout.

    Choose request timeout

    Now the most important thing, in my opinion, is which request_timeout. We proceed from how long it is permissible for a user to wait — this is a certain maximum. It is clear that the user will not wait more than 10 seconds, so you need to respond to him faster.

    • If we want to handle the failure of a single server, then the timeout must be less than the maximum allowable timeout: request_timeout <max.
    • If you want to have 2 guaranteed attempts to send a request for two different backends, then the timeout for one attempt is equal to half of this allowable interval: per_try_timeout = 0.5 * max.
    • There is also an intermediate option - 2 optimistic attempts in case the first backend blunted, but the second one responds quickly: per_try_timeout = k * max (where k> 0.5).

    There are different approaches, but in general it’s difficult to choose a timeout . There will always be boundary cases, for example, the same handler is processed in 99% of cases in 10 ms, but there are 1% of cases when we wait for 500 ms, and this is normal. It will have to settle.

    With this 1% you need to do something, because the whole group of requests must, for example, comply with the SLA and fit into 100 ms. Very often in these moments the application is processed:

    • A paging appears in those places where it is impossible to give all the data entirely in timeout.
    • Admin / reports are separated into a separate group of urls in order to raise timeout for them, and yes user requests, on the contrary, lower them.
    • We fix / optimize those requests that do not fit into our timeout.

    Immediately we need to take a not very simple psychological decision that if we don’t have time to answer the user within the allotted time, we give an error (it’s like in the ancient Chinese saying: “If the mare died, let it fall!”) .

    After that, the process of monitoring your service from the user's point of view is simplified:

    • If there are errors, everything is bad, it needs to be repaired.
    • If there are no errors, we are at the right time to answer, then everything is fine.

    Speculative retries

    We made sure that choosing the timeout value is quite difficult. As it is known, in order to simplify something, it is necessary to complicate something :)

    Speculative Retry  - a repeated request to another server, which is started by some condition, but the first request is not interrupted. The answer we take from the server that responded faster.

    I have not seen this feature in the known to me balancers, but there is a great example of Cassandra (rapid read protection):

       speculative_retry = N ms | M th percentile

    Thus, you do not need to time out . You can leave it at an acceptable level and in any case have a second attempt to get an answer to the request.

    In Cassandra there is an interesting opportunity to set a static speculative_retry or dynamic, then the second attempt will be made through the response time percentile. Cassandra collects statistics on the response times of previous requests and adapts a specific timeout value. It works quite well.

    In this approach, everything is on the balance between reliability and parasitic load. Not the servers You provide reliability, but sometimes you get extra requests to the server. If you hurry somewhere and send a second request, and the first one answered, the server received a little more load. In a single case, this is a minor issue.

    Timeout consistency is another important aspect. We'll talk more about the request cancellation, but in general, if the timeout for the entire user request is 100 ms, then there is no point in setting the timeout for the request to the database for 1 s. There are systems that allow you to do this dynamically: the service to the service transmits the rest of the time that you will wait for an answer to this request. It's complicated, but if you suddenly need it, you can easily find how to do it in the same Envoy.

    What else do you need to know about retry?

    Point of no return (V1)

    Here V1 is not version 1. In aviation, there is such a thing - speed V1. This is the speed, after which the acceleration along the runway cannot be slowed down. It is necessary to take off, and then decide on what to do next.

    The same point of no return is in load balancers: when you sent 1 byte of response to your client, no errors can be corrected . If the backend dies at this point, no retries will help. You can only reduce the likelihood of such a scenario triggering, make a graceful shutdown, that is, say to your application: “You are not accepting new requests now, but you’re working upon the old ones!” And only then extinguish it.

    If you control the client, this is some tricky Ajax or mobile application, it can try to repeat the request, and then you can get out of this situation.

    Point of no return [Envoy]

    Envoy was such a strange thing. There is per_try_timeout - it limits how much each attempt to get an answer to a request can take. If this timeout worked, but the backend had already begun to respond to the client, then everything was interrupted, the client received an error.

    My colleague Pavel Trukhanov ( tru_pablo ) made a patch , which will be in 1.7 in master Envoy. Now it works as it should: if the response began to be transmitted, only global timeout will work.

    Retries: need to limit

    Retries are good, but there are so-called killer queries: heavy queries that perform very complex logic, access the database a lot and often do not fit per_try_timeout. If we send retry again and again, then we kill our base. Because in most (99.9%) database services there is no request cancellation .

    Request cancellation means that the client has unhooked, you need to stop all work right now. Golang now actively promotes this approach, but, unfortunately, it ends with a backend, and many database repositories do not support this.

    Accordingly, retries need to be limited, which allows almost all balancers (we are no longer considering HAProxy from now on).


    • proxy_next_upstream_timeout (global)
    • proxt_read_timeout ** as per_try_timeout
    • proxy_next_upstream_tries


    • timeout (global)
    • per_try_timeout
    • num_retries

    In Nginx, we can say that we are trying to do retries for the duration of the X window, that is, at a given time interval, for example, 500 ms we do as many retries as will fit. Or there is a setting that limits the number of repeated samples. In  Envoy as well - the number or timeout (global).

    Retries: use [nginx]

    Consider an example: we set up retry in nginx - accordingly, having received HTTP 503, we try to send a request to the server again. Then turn off the two backends.

    upstream backends {   
    server {    
    proxy_next_upstream error timeout http_503;
    proxy_next_upstream_tries 2;            
      location / {
        proxy_pass http://backends;

    Below are the graphics of our test bench. There are no errors on the top chart, because there are very few of them. If you leave only errors, it is clear that they are.

    What happened?

    • proxy_next_upstream_tries = 2.
    • In the case when you make the first attempt to the "dead" server, and the second to the other "dead", you get HTTP-503 in the case of both attempts to the "bad" servers.
    • There are few errors, since nginx "bans" a bad server. That is, if some errors returned from the backend in nginx, it stops making the next attempts to send a request to it. This is governed by the variable fail_timeout.

    But there are mistakes, and it does not suit us.

    What to do with it?

    We can either increase the number of retries (but then we return to the problem of “killer requests”), or we can reduce the likelihood of a request for dead backends. This can be done with health checks.

    Health checks

    I propose to consider health checks as optimizing the process of choosing a “live” server. This in no way gives any guarantees. Accordingly, during the execution of a user request, we are more likely to get only to the “live” servers. The balancer regularly calls at a specific URL, the server responds to it: "I am alive and ready."

    Health checks: in terms of backend

    From the point of view of the backend you can do interesting things:

    • Check the availability of all underlying subsystems on which the work of the backend depends: the necessary number of connections to the database are established, there are free connections in the pool, etc., etc.
    • On the Health checks URL, you can hang your logic if the balancer you are using is not particularly intelligent (let's say you take the Load Balancer from the hoster). The server can remember that “in the last minute I have given so many errors - probably, I’m some kind of“ wrong ”server, and for the next 2 minutes I’ll respond with“ five hundred ”on Health checks. Thus, I will ban myself myself! ”This sometimes saves a lot when you have an uncontrolled Load Balancer.
    • As a rule, the check interval is about a second, and it is necessary that the Health check handler does not kill your server. It should be easy.

    Health checks: implementation

    As a rule, everything here is all about the same:

    • Request;
    • Timeout on it;
    • Interval, through which we do checks. The clever proxies have jitter , that is, some randomization so that all Health checks do not come to the backend all at once, and do not kill it.
    • Unhealthy threshold  - the threshold for how many unhealthy Health checks must pass in order to mark the service as Unhealthy.
    • Healthy threshold  - on the contrary, how many successful attempts must go through in order for the server to return to service.
    • Additional logic. You can disassemble Check status + body, etc.

    Nginx implements the Health check functions only in the paid version of nginx +.

    I will note feature of Envoy , it has Health check panic mode. When we are banned as “unhealthy,” more than N% of hosts (say, 70%), he believes that all of our Health checks are lying, and all the hosts are actually alive. In a very bad case, this will help you not to run into a situation where you yourself have shot your leg, and banned all the servers. This is a way to hedge once again.

    Putting it all together

    Usually for Health checks put:

    • Or nginx +;
    • Or nginx + something else :)

    In our country, there is a tendency to put nginx + HAProxy, because in the free version of nginx there are no health checks, and until 1.11.5 there was no limit on the number of connections to the backend. But this option is bad because HAProxy does not know how to retract after a connection is established. Many people think that if HAProxy returns an error to nginx, and nginx retry, then everything will be fine. Not really. You can go to another HAProxy and the same backend, because the backend pools are the same. So you enter for yourself another level of abstraction, which reduces the accuracy of your balancing and, accordingly, the availability of the service.

    We have nginx + Envoy, but, if confused, we can restrict ourselves to Envoy only.

    What is such an Envoy?

    Envoy is a trendy youth load balancer, originally developed in Lyft, written in C ++. Out of the box can a bunch of buns on our current topic. You probably saw it as a Service Mesh for Kubernetes. As a rule, Envoy acts as a data plane, that is, directly balances traffic, and there is also a control plane, which provides information about what to distribute the load (service discovery, etc.).

    I'll tell you a few words about his buns.

    To increase the likelihood of a successful response when retry on the next attempt, you can sleep for a while and wait until the backend comes around. In this way we will handle short problems on the database. Envoy has backoff for retries - pauses between retries. Moreover, the delay interval between attempts increases exponentially. The first retry occurs after 0-24 ms, the second after 0-74 ms, and then for each next attempt the interval increases, and the specific delay is chosen randomly from this interval.

    The second approach is not an Envoy-specific thing, but a pattern called Circuit breaking (lit. chain break or fuse). When we have a backend blunts, in fact, we are trying to finish it every time. This is because users in any incomprehensible situation press the refresh-at page, sending you more and more new requests. Your balancers are nervous, send retries, the number of requests is increasing - the load is growing, and in this situation it would be good not to send requests.

    Circuit breaker just allows you to determine that we are in such a state, to quickly shoot off an error and let the backends “catch their breath”.

    Circuit breaker (hystrix like libs), original on ebay blog.

    Above is the Circuit breaker circuit from Hystrix. Hystrix is ​​a Java library from Netflix that is designed to implement fault tolerance patterns.

    • The “fuse” can be in the “closed” state when all requests are sent to the backend and there are no errors.
    • When a certain fail-threshold is triggered, that is, some errors have occurred, the circuit breaker goes into the “Open” state. It quickly returns an error to the client, and requests do not fall on the backend.
    • Once in a while, still a small part of requests is sent to the backend. If an error is triggered, the state remains “Open”. If everything starts to work well and respond, the "fuse" is closed, and work continues.

    In Envoy, as such, this is not all. There are high-level limits on the fact that there can not be more than N requests for a particular upstream group. If more, something is wrong here - we return an error. There can be no more N active retries (i.e. retries that are happening right now).

    You did not have retries, something exploded - send retries. Envoy understands that more than N is not normal, and all requests should be shot off with an error.

    Circuit breaking [Envoy]

    • Cluster (upstream group) max connections
    • Cluster max pending requests
    • Cluster max requests
    • Cluster max active retries

    This simple thing works well, understandably configurable, no need to come up with special parameters, and the default settings are quite good.

    Circuit breaker: our experience

    Previously, we had an HTTP metrics collector, that is, agents installed on our clients' servers sent metrics to our cloud via HTTP. If we have any problems in the infrastructure, the agent writes the metrics to his disk and then tries to send them to us.

    And the agents are constantly trying to send us data, they are not upset that we somehow do not respond, and do not leave.

    If, after recovery, we allow the full flow of requests (besides, the load will be even more normal, as we have accumulated metrics) on the servers, most likely everything will fall again, as some components will end up with a cold cache or something.

    To cope with this problem, we clamped the incoming stream of the record through nginx limit req. That is, we say that we are processing, say, 200 RPS. When it all came to normal mode, we removed the restriction, because in a normal situation the collector is able to record much more than the crisis limit req.

    Then for some reason we switched to our protocol on top of TCP and lost HTTP proxying buns (the ability to use nginx limit req). And in any case, it was necessary to put this site in order. We no longer wanted to change limit req with our hands.

    We have enough opportunity, because we control both the client and the server. Therefore, we are  in the agentCircuit breaker codes, which understood that if he got N errors in a row, he needs to sleep, and after some time, and exponentially increasing, try again. When everything is normalized, it adds all the metrics, as it has a spool on the disk.

    On the  server, we added the Circuit breaker code for all calls to all subsystems + request cancellation (where possible). Thus, if we received N errors from Cassandra, N errors from Elastic, from the base, from a neighboring service - from anything, we quickly give an error and do not perform further logic. We just shoot a mistake and that's it - wait until it normalizes.

    The graphs above show that we do not get a surge of errors in case of problems (conditionally: gray is “two hundred,” red is “five hundred”). It can be seen that at the time of problems from 800 RPS to the backend flew 20-30. This allowed our backend to "catch our breath", to rise, and continue to work well.

    The most difficult mistakes

    The most difficult mistakes are those that are ambiguous.

    If your server just turned off and does not turn on, or you realize that it is dead and you finished it yourself, this is actually a gift of fate. This option is well formalized.

    Much worse when one server starts to behave unpredictably, for example, the server responds to all requests with errors, and to Health checks - HTTP 200.

    I will give an example from life.

    We have a Load Balancer, 3 nodes, on each of which is an application and under it Cassandra. Applications access all instances of Cassandra, and all Cassandra interact with neighbors, because Cassandra has a two-level coordinator model and data noda.

    Schrödinger's server is one of them:kernel: NETDEV WATCHDOG: eth0 (ixgbe): transmit queue 3 timed out.

    The following happened there: in the network driver, the bug (in the Intel drivers they happen), and one of the 64 transmission queues just stopped sending. Accordingly, 1/64 of all traffic is lost. This can occur before the reboot, it can not be treated.

    I, as the admin, are worried in this situation, not why there are such bugs in the driver. I’m worried why when you build a system without a single point of failure, problems on one server ultimately lead to the failure of the entire production. I was wondering how this can be resolved, and on the machine. I do not want to wake up at night to turn off the server.

    Cassandra: coordinator -> nodes

    Cassandra, there are those speculative retries, and this situation is handled very easily. There is a slight increase in latency on the 99 percentile, but this is not fatal, and in general everything works.

    App -> cassandra coordinator

    Nothing works out of the box. When an application knocks on Cassandra and hits the coordinator on a “dead” server, it doesn’t work out in any way, errors are generated, latency increases, etc.

    In fact, we use gocql - a heaped enough cassandra client. We just did not use all its features. There is a HostSelectionPolicy in which we can slip the bitly / go-hostpool library . It uses Epsilon greedy algorithms to find and remove outliers from the list.

    I will try to briefly explain how the Epsilon-greedy algorithm works .

    There is a multi-armed bandit problem: you are in a room with slot machines, you have several coins, and you have to try to win as much as possible for N attempts.

    The algorithm includes two stages:

    1. The phase of " explore"  - when you explore: 10 coins spend on it to determine which machine is better.
    2. Phase " exploit"  - the remaining coins are lowered into the best machine.

    Accordingly, a small number of requests (10 - 30%) are sent to the  round - robin simply to all hosts, we consider failures, the response time, we choose the best. Then 70 - 90% of requests are sent to the best and update their statistics.

    Host-pool each server evaluates by the number of successful responses and response time. That is, he takes the fastest server in the end. He calculates the response time as a weighted average (fresh measurements are more significant - what was right now is much more significant). Therefore, we periodically reset the old measurements. So with a good chance our server, which returns errors or tupit, goes, and it almost stops receiving requests.


    We added “armor-piercing” (fault tolerance) at the application level — Cassandra and Cassandra coordinator-data. But if our balancer (nginx, Envoy — whatever) sends requests for a “bad” Application, which will blunt the address to any Cassandra, because it itself has a non-working grid, in any case, we will get problems.

    In the Envoy out of box there is Outlier detection  by:

    • Consecutive http-5xx.
    • Consecutive gateway errors (502,503,504).
    • Success rate.

    By successive "five hundred" it can be understood that something is wrong with the server and to ban it. But not forever, but for a time interval. Then a small number of requests begins to arrive there - if it is stupid again, we will ban it, but for a longer interval. Thus, it all comes down to the fact that virtually no requests fall on this problematic server.

    On the other hand, to protect against the uncontrollable explosion of this “cleverness”, we have max_ejection_percent. This is the maximum number of hosts we can count as an outlier, as a percentage of all available. That is, if we banned 70% of the hosts, it does not count, everyone is unbanned - in short, amnesty!

    This is a very cool thing that works great, while it is simple and clear - I advise you!


    I hope I convinced you that you need to fight with similar cases. I believe that to deal with the emission of errors and latency is definitely worth it in order to:

    • To do autodelivery, to be released the necessary number of times a day, etc.
    • Sleep well at night, do not jump up once again by SMS, but fix problem servers during working hours.

    Obviously, nothing works out of the box ; accept this fact. This is normal, because there are so many decisions that you have to make yourself, and the software cannot do it for you - it waits for you to configure it.

    No need to chase the most sophisticated balancer . For 99% of the problems, the standard features of nginx / HAProxy / Envoy are enough . A more sophisticated proxy will be needed if you want to make everything absolutely perfect and remove the five hundred micro-spikes.

    The point is not about the specific proxy (if it is not HAProxy :)), but how you set it up.

    At  DevOpsConf Russia, Nikolay will talk about the experience of implementing Kubernetes, taking into account the preservation of fault tolerance and the need to save human resources. What else awaits us at the conference can be found in the  Program .

    Want to receive updates, new transcripts, and important conference news - sign up for Ontiko's thematic newsletter on DevOps.

    Love to watch reports more than read articles, go to the  YouTube channel  - there are collected all the videos on the operation in recent years and the list is updated all the time.

    Also popular now: