Examine Kafka Bandwidth Limits in Dropbox Infrastructure
The widespread use of Apache stack technology is an obvious trend. And Kafka is at the forefront of popularity: nowadays, people who know such a message broker, perhaps, outnumber the number of those who are used to seeing the word Franz next to the word Kafka.
We ourselves actively use this technology in our projects. But it is always interesting, but how does it work for others? And it is doubly interesting if this is not just an example from someone else’s practice, but targeted testing of technology. Therefore, we have translated a recent article that tells how Dropbox empirically searched for the limits of possibilities and endurance limits at Kafka. And I found what I wanted.
Exploring the Kafka bandwidth limits in the Dropbox
Apache Kafka infrastructure is a popular solution for distributed streaming and sequential processing of large amounts of data. It is widely used in the high-tech industry, and Dropbox is no exception. Kafka plays an important role in the data structure of many of our critical distributed systems — data analysis, machine learning, monitoring, search, and stream processing (Cape) (—and these are just a few of them).
In Dropbox, Kafka clusters are managed by the Jetstream team, whose main responsibility is to provide high quality services related to Kafka. Understanding the Kafka bandwidth limit within the Dropbox infrastructure is critical to making the right resource allocation decisions for different use cases, and this was a priority for the team. We recently created an automated test platform to achieve this goal. And in this publication we would like to share our method and findings.
The figure above shows the parameters of our test platform for this study. We use Spark to host Kafka clients, which allows us to generate and consume traffic in any amount. We created three Kafka clusters of different sizes, so that tuning the cluster size was reduced literally to redirecting traffic to another point. A topic was created in Kafka for the production and consumption of test traffic. For simplicity, we distribute traffic across all brokers evenly. To do this, we created a test topic with the number of sections ten times the number of brokers. Each broker has exactly 10 sections. Since the entry in the section goes sequentially, too few sections allocated to one broker can lead to a competitive entry, which limits the bandwidth. Our experiments have shown
Due to the distributed nature of our infrastructure, our clients are located in various regions of the United States. Given that our test traffic is well below the limit of the main channels of Dropbox, we can safely assume that this limit of inter-regional traffic is also applicable to local traffic.
What influences the load?
There are many factors that can affect the Kafka cluster load: the number of manufacturers, the number of consumer groups, the initial offset (offset) of consumers, the number of messages per second, the size of each message, the number of topics and sections involved. And these are just some of them. The degree of freedom of setting parameters is high. Thus, we need to find the dominant factors in order to reduce the complexity of testing to an acceptable level.
We studied various combinations of parameters that we found suitable. We came to the surprising conclusion that the dominant factors that should be taken into account are the main components of the load: the number of messages produced per second (s / s) and the number of bytes per message (b / s).
We adopted a formal approach to understanding the limitations of Kafka. For a specific Kafka cluster, there is a connected traffic space. Each point in this multidimensional space corresponds to a unique state of traffic applicable to Kafka and represented as a vector of parameters: <s / s, b / s, # producers, # consumer groups, # topics, ...>. All traffic conditions that do not overload KafKa form a closed subspace whose surface will limit the Kafka cluster.
For our first test, we chose s / s and b / s as the main parameters and reduced the traffic space to a two-dimensional plane. The boundaries of acceptable traffic form clear tracking areas. The detection of the Kafka limit in our case is equivalent to the determination of the boundary values of this area.
In order to establish boundaries with sufficient accuracy, it was necessary to conduct hundreds of tests with different parameters - which would be extremely irrational to do manually. Therefore, we have developed an algorithm that allows you to perform all the experiments without human intervention.
It is very important to find a set of indicators that allows you to programmatically assess the state of Kafka. We explored a wide range of possible indicators and focused on the following small sample:
- a simple I / O flow is below 20%: this means that the pool of workflows used by Kafka to process client requests is too heavy and cannot cope with incoming tasks.
- changing the set of synchronized replicas (ISR) by more than 50%: this means that when using traffic for 50% of the observed time, at least one broker does not have time to duplicate the data received from its leader.
The same indicators are used in Jetstream to monitor the state of Kafka and serve as the first alarm signals of cluster overload.
To determine a single boundary value, we fix the b / s indicators, and then change the c / s indicators to bring Kafka to overload. It is possible to determine the border value of c / s when we know the safe value of c / c and close to it, but already causing an overload. Of these two, the safe value of c / s is taken as borderline. As shown below, the line of boundary values is formed by the results of similar tests with different b / c indicators:
It is worth noting that instead of directly adjusting s / s, we experimented with a different number of manufacturers having the same production speed, denoted by np. The fact is that batch processing of messages complicates control over the speed of production of a separate manufacturer. A change in the number of manufacturers, by contrast, allows linear traffic to change. According to our early studies, simply increasing the number of manufacturers will not create a noticeable difference in the load on Kafka.
For a start, we find a separate border value using binary search. The search begins with a very large range of np [0, max], where max is the value that necessarily leads to overload. In each iteration, an average value is selected to generate traffic. If Kafka is overloaded at this value, then this average value becomes the new upper bound; otherwise, it becomes a new lower bound. The search process stops when the range narrows enough. Then we consider the indicators s / s, correlated with the established lower boundary of the boundary values.
As you can see in the above diagram, we set the boundary values for Kafka of different sizes. Based on the results, we came to the conclusion that the maximum possible throughput of the Dropbox infrastructure is 60 Mb / s per broker.
It should be emphasized that this is a conservative limit, since the content of our test messages was as random as possible for the sake of reducing the effect of internal compression of messages in Kafka. When traffic reaches its limit, both disk and network are fully utilized. In working scenarios, Kafka messages usually follow a specific pattern, since they are often formed by similar algorithms. This provides significant opportunities for optimizing message compression. We tested an extreme scenario, when the messages consist of one character, and recorded a much higher throughput, since the disk and the network are no longer a bottleneck.
In addition, these throughput figures are correct if there are at least 5 consumer groups that subscribed to the topic being tested. In other words, the indicated recording bandwidth is reached when the reading bandwidth is 5 times larger. When the number of consumer groups exceeds 5, the recording bandwidth begins to decline as the network becomes a bottleneck. Since the ratio of reading and recording traffic is much less than 5 in cases of using Dropbox production clusters, the obtained throughput applies to all production clusters.
This result will help to better plan resources for future Kafka. For example, if we want to allow up to 20% of all brokers to work offline, then the maximum safe bandwidth of one broker should be 60 MB / s * 0.8 ~ = 50 MB / s. This number can be used to determine the size of a cluster, depending on the planned capacity of future use cases.
Tools for future work
The platform and automatic tester will be valuable tools for the Jetstream team in their future work. When we move to a new hardware, change the network configuration or update the version of Kafka, we can simply re-run these tests and get new data on the permissible limits of the new configuration. We can apply the same methodology to study other factors that can affect Kafka's performance in various ways. Finally, the platform can act as a Jetstream test bench for simulating new traffic options or for reproducing problems in an isolated environment.
In this article, we presented our systematic approach to understanding the limitations of Kafka. It is important to note that we have achieved these results based on the Dropbox infrastructure, so our numbers may not be applicable to other Kafka installations due to differences in hardware, software and network conditions. We hope that the methodology presented here can help readers understand their own systems.