Adaptive Parallel Connection Limits in Netflix

Original author: Eran Landau, William Thurston, Tim Bozarth
  • Transfer


Netflix is ​​obsessed with service availability. We have more than once considered it in our blog and told us how we manage to achieve our goals. We use circuit breakers, parallel connection limits, testing with the help of deliberately introducing errors (chaos testing) and much more. Today we present you another innovative approach that significantly increases the stability of the application under extreme loads and avoids cascading failures in the services - adaptive limits for parallel connections. You no longer need to spend energy to determine the limits of parallel connections, allowing the system to maintain a short response time. As part of this announcement, we also lay out in open access a simple Java library with integration capabilities for servlets, control programs and gRPC.

Let's start with the basics


The limit of parallel connections is the maximum number of requests that the system is capable of processing at a certain point in time. Typically, this number depends on a limited resource, such as the computing power of the central processor. Usually, the limit of parallel connections of a system is calculated according to the Little law, which sounds like this: for a stable system, the maximum number of parallel connections is equal to the product of the average time spent processing the request and the average intensity of incoming requests (L = λW). Any requests beyond the limit of parallel connections cannot be immediately processed by the system, so they will be queued or rejected. Queuing is an important function that allows you to fully use the system in cases



If there is no limit for the queue, the system may crash, for example, if for a long time the intensity of requests will exceed their processing speed. As the queue grows, so does the delay, which leads to exceeding the waiting time for requests. This continues until the free memory runs out, after which the system crashes. If you do not keep up with the increasing delay time, it will adversely affect the calling services and lead to cascading system failures.



The use of parallel connection limits is standard practice, but the difficulty lies in their definition for large dynamic distributed systems, where parameters such as the delay time and the possible number of parallel connections are constantly changing. The essence of our solution is the ability to dynamically determine the limit of parallel connections. This limit can be represented as the number of incoming requests (executed in parallel and queued), which the system is able to process until its performance starts to decrease (and the delay time increases).

Decision


Previously, Netflix employees defined manual connection limits for parallel connections that could not be changed using time-consuming performance testing and profiling. The resulting number was correct for a specific period of time, but soon the topology of the system began to change due to partial failures, automatic scaling, or the introduction of additional code that affected the delay time. As a result, the limit is obsolete. We knew that we were capable of more, that it was not enough for us to simply define connection limits statically. We needed a way to automatically determine the limits of the system itself. At the same time, we wanted this method:

  1. did not require manual work;
  2. did not require central coordination;
  3. could determine the limit without any information about the hardware or topology of the system;
  4. adapted to changes in the topology of the system;
  5. was simple in terms of implementation and necessary calculations.

To solve this problem, we turned to the proven overload tracking algorithm in the TCP protocol. This algorithm determines the number of data packets that can be transmitted in parallel (that is, the size of the overflow window) without increasing the delay time and not exceeding the waiting time. These algorithms use different metrics to determine the limit of packets transmitted simultaneously and resize the overflow window accordingly.



The blue color in the figure shows the unknown limit of parallel connections to the system. First, the client sends a small number of parallel requests, and then begins periodically checking the system to see if it can process more requests, increasing the overflow window until it causes an increase in latency. When the delay does increase, the sender decides that it has reached the limit, and again reduces the size of the overflow window. Such continuous testing of the limit is reflected in the graph that you see above.

Our algorithm relies on the TCP congestion tracking algorithm, which considers the relationship between the minimum delay time (the best possible scenario in which the queue is not used) and the delay time, which is periodically measured as requests are executed. This ratio makes it possible to determine that a queue has formed, which provokes an increase in the delay. This ratio gives us the gradient or magnitude of the change in the delay time: gradient = (RTTnoload / RTTactual). If the value is one, then we understand that there is no queue and the limit can be increased. A value less than one indicates that the queue is full and the limit needs to be reduced. With each new measurement of the delay time, the limit is adjusted based on the above ratio, and with it the permissible queue size changes according to this simple formula:

Новый_лимит = текущий_лимит × градиент + размер_очереди

After several iterations, the algorithm calculates a limit, which allows not only to keep the delay time low, but also to form the necessary queue of requests for the case of activity flashes. The allowed queue size can be customized. It is used to determine how quickly the limit of parallel connections can increase. As the default size, we chose the square root of the current limit value. This choice is due to the useful property of the square root: for small values ​​it will be large enough compared to the limit to ensure rapid growth, but for large values, on the contrary, its relative value will be smaller, which will increase the stability of the system.

Adaptive limits in action


Adaptive limits on the server side reject unnecessary requests and maintain low latency, which allows the system instance to protect itself and the services it depends on. Previously, when it was not possible to reject excessive requests, any steady increase in the number of requests per second or the delay time led to an even greater increase in this time and ultimately to a drop in the entire system. Today, services can get rid of excess load and maintain low latency simultaneously with the work of other stabilizing tools, such as automatic scaling.



It is important to remember that limits are set at the server level (and without any coordination), that the traffic to each server can drop dramatically and increase. Therefore, it is not surprising that the revealed limit and the number of parallel connections may vary depending on the server. This is especially true in a cloud environment with multiple clients. As a result, a situation may arise when one server is overloaded, although the rest will be free. At the same time, when load balancing on the client side, only one repeated request will reach the server with free resources in almost 100% of cases. And that's not all: there is no longer any reason to worry that repeated requests will trigger a DDOS attack, since services are able to quickly (in less than a millisecond) reject traffic with minimal impact on performance.

Conclusion


The use of adaptive limits for parallel connections eliminates the need to manually determine how and in what cases our services should reject traffic. Moreover, it also increases the overall reliability and availability of our entire ecosystem based on microservices.

We are happy to share with you our methods of implementation and the overall integration of this solution, which you can find in the public library at: github.com/Netflix/concurrency-limits . We hope that our code will help users protect their services from cascading failures and problems with increasing latency, as well as increase their availability.

Also popular now: