Serving thousands of queries per second using the XBT Tracker
A test was recently conducted, the results of which showed that one application processes 2000 requests per second on a modest server, where this was not the only load. In this case, the result of each query is recorded in 3-5 tables in MySQL. Honestly, I was surprised by this result, so I decided to share a description of the architecture of this application with the habrasociety. A similar approach is applicable from banner displays to chats and microblogging, I hope someone will find it interesting.
Firstly, this application is single-threaded. Everything is done by one process, work with sockets - non-blocking epoll / select, no threads waiting for input / output (threads). With the development of HTTP, first with the advent of Keep-Alive, then with AJAX and the growing popularity of COMET, the number of permanent connections to the web server is growing, on loaded projects it is measured in the thousands and even tens of thousands, and if for each you create your own thread with its own stack and constantly switch between them - server resources will not be enough quickly.
The second key point is that one SELECT ... WHERE pk in (k1, k2, ..., kN) is faster than several SELECT ... WHERE pk = ... Performing work with the database in large batches can reduce not only the number of queries per second, but and overall load.
XBT Tracker (XBTT) - bittorrent tracker. Please refrain from the topic of copyright, because the torrent is officially used, for example, to distribute Linux distributions and patches for World of Warcraft. Unlike ed2k and DC ++, it is possible to put several files in one torrent without packing them in the archive, and at any time check the integrity of the file, and if necessary restore it by downloading broken pieces.
When downloading, the client periodically turns to the tracker, reporting traffic statistics and receiving addresses of other distributors and downloaders. The more often such calls are made, the more accurate is the traffic count (if this is a closed tracker) and the faster the new distribution participants will learn about each other.
XBT Tracker, about which this post is written in C ++ and is used on many foreign trackers, both open and closed, and even a couple of Russian ones. Another high-performance tracker, OpenTracker , does not support closed trackers based on traffic, so it does not need to write the results of queries to the database, therefore it is less interesting in this context.
In the 90s, when working with sockets, blocking input-output was used, when, when calling the recv and send methods, the current thread “hung” until the results were expected. For each received connection, a separate process was created (fork) in which its request was processed. But each process requires stack memory and processor time to switch context between processes. On small loads, this is not scary, and the web was not interactive at that time, completely in request-response mode, there was little dynamic context (CGI), mainly page counts and primitive under-forums. Apache still works this way. In apache2 there is the possibility of using lighter threads (threads) instead of processes, but the essence remains the same.
As an alternative to this, non-blocking I / O appeared, when one process could open many sockets, periodically poll their status, and if any events appeared, for example, a new connection came in or data to read came in - serve them. This is how nginx works, for example . In Java version 1.4 and above, there is NIO for this.
Further improvements appeared, for example, TCP_DEFER_ACCEPT, which allows “deferring” the connection until data came through it, SO_ACCEPTFILTER, delaying the connection until a full HTTP request was received. Now you can increase the queue of missed connections (by default there are only 128) with sysctl kern.ipc.somaxconn in BSD and sysctl net.core.somaxconn in Linux, which is especially important if there are pauses in socket processing.
Requests in XBTT are very simple, their processing does not require special computing resources, it keeps all the necessary data in memory, so there is no problem to execute them in the same process as working with sockets. In the case of more serious tasks, it is still necessary to create separate threads for their maintenance.
One way out is to create a thread pool (thread pool), which sends a request for processing, after which the thread returns back to the pool. If there are no free threads, the request is waiting in line. This approach allows you to reduce the total number of threads used, and each time you do not have to create a new one and kill it after completion of the request processing.
An even better mechanism, called actors, is in the erlang and scala languages, possibly in the form of libraries, and for other languages. Processing is carried out by means of asynchronous transfer of messages between actors, which can be imagined as sending e-mails with an inbox for everyone, but this topic is beyond the scope of this post (for example, here is a fresh post about this).
The result of each call to the XBTT tracker is recorded in several tables. The user increases his downloaded and flooded traffic. Torrent statistics are increasing. The table of current distribution participants is filled. Plus a couple of service tables with download history.
With the traditional processing method, at least 3 separate INSERT or UPDATE would be executed for each request to the tracker, the client would wait for their execution, so the database server would have to execute 3 requests for each call to the tracker.
XBTT does not execute them immediately, but accumulates a large bundle of INSERT ... VALUES (...), (...). ..., (...) ON DUPLICATE KEY UPDATE f1 = VALUES (f1), ..., fN = VALUES (fN), and executes once every few seconds. Due to what, the number of queries to the database decreases from several per query to the tracker to several per minute. He also periodically re-reads the necessary data that could change from the outside (the web interface is independent of the tracker).
In this application, the loss of some data is not at all critical. If the torrent traffic statistics are not recorded in the database in a few seconds, nothing terrible will happen. Although during an abnormal termination it writes the accumulated buffers to the database, the server may have UPS in case of a power outage, etc. - there is no guarantee that all data transferred by the client is written to disk. For a banner network, this is also not scary, but there are tasks where saving all the data is critical.
Similarly, not all applications have the ability to store all data in memory. To process a client request, it may be necessary to select data from the database.
But in this case, block data processing is possible. A pipeline is organized (pipeline; actors are perfectly suited for its implementation) from several stages, at each stage a group of data is assembled for the query as soon as a sufficient amount (naturally, custom) has accumulated or some time has passed (for example, 10-100 milliseconds), during which the required quantity was not accumulated - a group request is made to the database, where instead of “key = value” the condition “key IN (accumulated list)” is set.
If it is necessary to lock these records, then FOR UPDATE SKIP LOCKED can be added to the request (naturally, the results will need to be written in the same connection to the database, the same transaction). You can use the Prepared Statement in those databases that support it, to perform a query analysis once (parse), choose the optimal plan for its execution, and then each time only substitute data in it. To reduce the number of such prepared requests, the number of parameters to it can be taken only in powers of two: 1, 2, 4, 8, 16, 32 ... You can also group (batch) requests, at first not executing each, but only adding to the packet, and then complete all at once.
Firstly, this application is single-threaded. Everything is done by one process, work with sockets - non-blocking epoll / select, no threads waiting for input / output (threads). With the development of HTTP, first with the advent of Keep-Alive, then with AJAX and the growing popularity of COMET, the number of permanent connections to the web server is growing, on loaded projects it is measured in the thousands and even tens of thousands, and if for each you create your own thread with its own stack and constantly switch between them - server resources will not be enough quickly.
The second key point is that one SELECT ... WHERE pk in (k1, k2, ..., kN) is faster than several SELECT ... WHERE pk = ... Performing work with the database in large batches can reduce not only the number of queries per second, but and overall load.
Subject area
XBT Tracker (XBTT) - bittorrent tracker. Please refrain from the topic of copyright, because the torrent is officially used, for example, to distribute Linux distributions and patches for World of Warcraft. Unlike ed2k and DC ++, it is possible to put several files in one torrent without packing them in the archive, and at any time check the integrity of the file, and if necessary restore it by downloading broken pieces.
When downloading, the client periodically turns to the tracker, reporting traffic statistics and receiving addresses of other distributors and downloaders. The more often such calls are made, the more accurate is the traffic count (if this is a closed tracker) and the faster the new distribution participants will learn about each other.
XBT Tracker, about which this post is written in C ++ and is used on many foreign trackers, both open and closed, and even a couple of Russian ones. Another high-performance tracker, OpenTracker , does not support closed trackers based on traffic, so it does not need to write the results of queries to the database, therefore it is less interesting in this context.
Non-Blocking I / O
In the 90s, when working with sockets, blocking input-output was used, when, when calling the recv and send methods, the current thread “hung” until the results were expected. For each received connection, a separate process was created (fork) in which its request was processed. But each process requires stack memory and processor time to switch context between processes. On small loads, this is not scary, and the web was not interactive at that time, completely in request-response mode, there was little dynamic context (CGI), mainly page counts and primitive under-forums. Apache still works this way. In apache2 there is the possibility of using lighter threads (threads) instead of processes, but the essence remains the same.
As an alternative to this, non-blocking I / O appeared, when one process could open many sockets, periodically poll their status, and if any events appeared, for example, a new connection came in or data to read came in - serve them. This is how nginx works, for example . In Java version 1.4 and above, there is NIO for this.
Further improvements appeared, for example, TCP_DEFER_ACCEPT, which allows “deferring” the connection until data came through it, SO_ACCEPTFILTER, delaying the connection until a full HTTP request was received. Now you can increase the queue of missed connections (by default there are only 128) with sysctl kern.ipc.somaxconn in BSD and sysctl net.core.somaxconn in Linux, which is especially important if there are pauses in socket processing.
Request service
Requests in XBTT are very simple, their processing does not require special computing resources, it keeps all the necessary data in memory, so there is no problem to execute them in the same process as working with sockets. In the case of more serious tasks, it is still necessary to create separate threads for their maintenance.
One way out is to create a thread pool (thread pool), which sends a request for processing, after which the thread returns back to the pool. If there are no free threads, the request is waiting in line. This approach allows you to reduce the total number of threads used, and each time you do not have to create a new one and kill it after completion of the request processing.
An even better mechanism, called actors, is in the erlang and scala languages, possibly in the form of libraries, and for other languages. Processing is carried out by means of asynchronous transfer of messages between actors, which can be imagined as sending e-mails with an inbox for everyone, but this topic is beyond the scope of this post (for example, here is a fresh post about this).
Batch work with the database
The result of each call to the XBTT tracker is recorded in several tables. The user increases his downloaded and flooded traffic. Torrent statistics are increasing. The table of current distribution participants is filled. Plus a couple of service tables with download history.
With the traditional processing method, at least 3 separate INSERT or UPDATE would be executed for each request to the tracker, the client would wait for their execution, so the database server would have to execute 3 requests for each call to the tracker.
XBTT does not execute them immediately, but accumulates a large bundle of INSERT ... VALUES (...), (...). ..., (...) ON DUPLICATE KEY UPDATE f1 = VALUES (f1), ..., fN = VALUES (fN), and executes once every few seconds. Due to what, the number of queries to the database decreases from several per query to the tracker to several per minute. He also periodically re-reads the necessary data that could change from the outside (the web interface is independent of the tracker).
Postponement criticality
In this application, the loss of some data is not at all critical. If the torrent traffic statistics are not recorded in the database in a few seconds, nothing terrible will happen. Although during an abnormal termination it writes the accumulated buffers to the database, the server may have UPS in case of a power outage, etc. - there is no guarantee that all data transferred by the client is written to disk. For a banner network, this is also not scary, but there are tasks where saving all the data is critical.
Similarly, not all applications have the ability to store all data in memory. To process a client request, it may be necessary to select data from the database.
But in this case, block data processing is possible. A pipeline is organized (pipeline; actors are perfectly suited for its implementation) from several stages, at each stage a group of data is assembled for the query as soon as a sufficient amount (naturally, custom) has accumulated or some time has passed (for example, 10-100 milliseconds), during which the required quantity was not accumulated - a group request is made to the database, where instead of “key = value” the condition “key IN (accumulated list)” is set.
If it is necessary to lock these records, then FOR UPDATE SKIP LOCKED can be added to the request (naturally, the results will need to be written in the same connection to the database, the same transaction). You can use the Prepared Statement in those databases that support it, to perform a query analysis once (parse), choose the optimal plan for its execution, and then each time only substitute data in it. To reduce the number of such prepared requests, the number of parameters to it can be taken only in powers of two: 1, 2, 4, 8, 16, 32 ... You can also group (batch) requests, at first not executing each, but only adding to the packet, and then complete all at once.