The performance of the trading platform on a simple example
In this article I want to talk in popular science about the optimization of response time in the trading platforms of exchanges and banks (HFT). For reference, we are talking about times from hundreds of nanoseconds to hundreds of microseconds. For most other applications, many of the following optimization methods are irrelevant simply due to the absence of such stringent requirements.
We usually consider performance in units of bandwidth. For example in Gigaflop. The optimization problem in such cases is reduced to performing the maximum number of calculations per unit of time or solving the problem in the shortest time. The design of the processor is designed primarily to achieve the maximum number of calculations per unit of time and standard optimization techniques for the same.
However, there are applications where response time is more important, for example, trading platforms in computer trading (HFT), search engines, robotics and telecom. The response time is the execution time of a “single” operation of this type, for example, from receiving a package with current quotes from the exchange to sending an order to an exchange transaction. In fact, the response time and throughput (the number of operations of this type per unit of time) are closely related, but the difference is fundamental. You can often increase bandwidth simply by adding hardware (more servers), but it is problematic to improve response time in a similar way (except for peak loads).
Several different methods are used to optimize response time. Some improve both response time and throughput, while others improve one at the expense of the other. For example, to improve throughput, buffering is typical to process an array of packets at a time. Obviously, for the response time to a single packet, such an approach is harmful.
In trading platforms, the stability of response time is also very important. Most of the profits and losses occur during sharp market movements, accompanied by abnormally high activity. The platform must withstand such loads. Any plugging can lead to tangible losses.
In general, such low-level response time optimization is a complex topic, requiring a good understanding of the network stack, the operating system kernel, processor and platform performance, and efficient thread synchronization. My task is to explain all these complex things with a simple and clear example.
Let's use the following analogy. Imagine a group of people working in an office. Communication takes place through the exchange of messages on paper (letters). Each letter contains the addressee, the sender and the task. Letters are placed on certain tables in the office. There are workers whose task is to receive letters from the outside world and put them on the tables. Others pick up letters from the tables and pass them on to decision makers. Each decision maker works only with a certain type of letters (or tasks).
The decision maker reads the letters intended for him and decides whether this task will be carried out, delayed or ignored. Tasks for execution are added to a separate table. Special workers pick up letters from this table and distribute to performers. Some letters must be answered outside the office, for example, a confirmation will be sent to the external sender.
To be closer to reality, let's make things a little more complicated. For example, an office is a complex network of rooms and corridors and different types of workers can only go to certain places where they have access. As the mathematicians say, without disturbing the community, let us assume that our office, under normal conditions, processes 200 messages per day with an average message processing time of 5 minutes.
So, our task is to minimize the message processing time. It is desirable that the maximum processing time does not exceed the average of more than, say, twice. That is, bursts of activity must be efficiently handled.
So where do we start? The simplest thing is to hire more workers to process more messages. It is not bad to look for fast workers, then the processing time will be reduced. Suppose we hired Usain Bolt and other Olympic finalists. Perhaps the processing time decreased to 2 minutes. But it is obvious that in this direction there is nowhere to go any further. Faster no one runs. The limit is reached. Comparing these approaches with a computer, hiring people is buying additional hardware (servers, processors, cores) to increase the number of execution threads. Hiring athletes is similar to buying the fastest possible iron (the maximum frequency in the first place).
Perhaps the layout of our office is not optimal. Enough space must be provided for workers to work efficiently. Maybe expand the corridors, and then people have to give way to each other losing precious time? Let's expand. Let's also slightly increase the rooms so that people do not crowd when approaching the tables. It is like buying servers with a large number of cores and more memory and I / O bandwidth.
In addition, we can go to the express service instead of regular mail to exchange messages with the outside world. In computer terms, this is similar to the selection and optimization of network equipment and the network stack of the operating system. All this is an additional cost, but we will assume that they will definitely pay off.
So, after the innovations, our message processing time dropped to, say, one minute. You can still train workers to improve the process of communication and execution. Perhaps it will give 15 percent with the right motivation. So we reached 51 seconds. This is similar to software optimization.
The next step is to avoid collisions of our fast-running workers. Probable bottleneck - the approach to the tables. It is desirable that workers have instant and simultaneous access to the tables they need. You can sort the messages on the tables when laying out (put in separate folders) to speed up access. Messages may also have different priority. In the program, this is an analogue of thread synchronization. Streams should have unlimited parallel and fastest access to data. Fixing problems with synchronization of threads often gives a huge increase in system bandwidth and helps improve response time. In terms of processing bursts of activity, the influence of the optimal synchronization algorithm is generally difficult to overestimate.
In addition, workers may sometimes find themselves in front of a closed door. Other minor problems of this property may cause inconvenience and delay. It is desirable to fulfill the following conditions: the number of people in a given room never exceeds its capacity, the speed of the workers is not limited by anything, no actions that are not related to the main job and no outsider fits into the work process. In computer terms, this means that the number of threads never exceeds the number of available cores, the platform is set to maximum frequency / performance, economy modes are disabled, Turbo mode is enabled and the operating system kernel and other applications are isolated and (almost) do not affect the trading platform.
Now it's time to consider even more attentively the conditions in the office. Do the doors open easily? Is the floor slipping? This is about the same as the analysis of interaction with the operating system. If there is nothing to improve, you can try to avoid using some parts. For example, instead of sending letters through the office, why not try throwing them from window to window? Say uncomfortable? Maybe uncomfortable, but quickly. This is analogous to the use of the kernel bypass approach in the network stack.
Instead of using the operating system's network stack, kernel bypass runs the network stack in user space. This helps to get rid of unnecessary copies of data between the system and user stack and the delay in executing the message receiving flow. In kernel bypass, the receive stream usually waits actively. He does not sit on the operating system lock, but continually checks the lock variable until it gives him permission to execute.
In fact, if we started throwing messages through the windows, let's do it effectively. The most reliable option is to pass it through the window from hand to hand. This principle is used in the TCP protocol. This is not the fastest option. UDP allows you to simply throw a message without confirmation. It's faster. No one needs to wait. Do you think this is the limit? No, you can still learn to throw through the window so that the letter falls right on the desired table and in the desired folder. This approach is called remote direct memory access (RDMA). I think we lowered the processing time of seconds to 35.
Or maybe build an office from scratch instead of adapting the existing one to our needs? Such that provided ideal working conditions. Perhaps this will improve the response time of seconds to 20, or even less. Own office design is the use of the field programmable gate array (FPGA). FPGA is something like a processor whose hardware is programmed to solve a specific problem. A conventional processor is encoded to execute a specific set of instructions on certain types of data, and the execution flow (not to be confused with the application flow) is also fixed. Unlike the processor, FPGAs are not pre-programmed for the instruction set, data types, and execution flow. They are programmed for a specific task and, in such a state, are capable of executing it only (until subsequent reprogramming). Effective FPGA programming is not an easy task. Making changes to the program may also require a lot of effort. And although the FPGA does not imply the hiring of Usain Bolt (the frequencies are much lower than the processor ones), but the unlimited parallelism in the execution of instructions makes it possible to achieve lower message processing times than on the processor.
So in conclusion, I will recommend performance analysis tools for software. Intel VTune TM Amplifier and Intel Processor Trace technology will help you see in detail where and why CPU time is spent.
If you are interested in the topic, you can read my articles on the Intel Developer Zone (in English), which also contains practical technical tips on optimizing response time.