High Frequency Trading (HFT) using FPGA

Original author: Christian Leber, Benjamin Geib, Heiner Litz
  • Transfer
This article talks about the development of a highly specialized hardware device for HFT purposes. His specialization is aimed at achieving the minimum possible time delays for processing market data and, therefore, at reducing the round trip time during transactions. The implementation described in this work parses Ethernet, IP and UDP packets, as well as the FAST protocol, which is the most common when transmitting market information. For such purposes, a proprietary microcode engine was developed, with support for a set of commands and a compiler, due to which support for a wide range of protocols used in trading is achieved. The final system was implemented in RTL code and runs on FPGA. This approach shows a 4-fold advantage over fully software solutions.

Introduction


High-frequency trading (HFT) has been given a lot of attention recently, and it has become a major player in the financial markets. The term HFT refers to a set of techniques for trading stocks and derivatives, when a large flow of applications is sent to the market with a round trip of less than a millisecond [1]. The goal of the tweeters is to come to the end of the day without any positions of securities available, and to profit from your strategies by buying and selling shares at a very high speed. Studies show that HFT traders hold an average of 22 seconds on average [2]. Aite Group claims that HFT has a significant impact on the market, more than 50% of US stock transactions were completely HFT in 2010, while growth in 2009 alone was 70% [3].

HFT traders use several strategy options, such as a liquidity strategy, a statistical arbitrage strategy, and a liquidity search strategy [2]. In a strategy to ensure liquidity, tweeters are trying to capitalize on the bid-ask spread, which reflects the difference by which sellers are willing to sell and buyers are willing to buy. High volatility and a wide bid-ask spread can turn into a profit for the HFT trader, and at the same time, he becomes a liquidity provider and narrows the bid-ask spread, as if acting as a market maker. Liquidity and a small ask-bid spread are important things because they reduce trading costs and allow you to more accurately determine the value of an asset [4]. Traders who use arbitrage strategies use the correlation between prices of derivatives and their underlying assets [5]. Liquidity search strategies explore the market in search of large orders, by sending small orders that help detect large hidden orders. All strategies are united by one thing, for work they require uncompromisingly low time delays, since only the fastest HFT companies are able to take advantage of opportunities arising in the market.

E-trading of shares occurs by sending requests in electronic form to the exchange. Purchase and sale orders are then matched on the exchange and a transaction is carried out. Bids are visible to all bidders through so-called feeds. A feed is a compressed or uncompressed data stream delivered in real time by some independent organization, such as Options Price Reporting Authority (OPRA). The feed contains financial information on stocks and is transmitted via multicast to market participants through standardized protocols, mainly via Ethernet via UDP. The standard protocol for supplying market data is the Financial Information Exchange (FIX) Adapted for Streaming (FAST), which is used on most exchanges [18].

In order to achieve a minimum delay time, the HFT engine must be optimized at all levels. To reduce the delay time during transmission over the network, collocation is used when the HFT server is installed next to the exchange gateway. The data feed should be distributed with minimal delay to the HFT servers. Efficient UDP and FAST packet processing is also necessary. And finally, the decision to create the application and its transmission should be carried out with the least possible delay. To achieve these goals, a new HFT engine was developed, implemented on the FPGA board. Thanks to the use of FPGA, it became possible to remove the burden of UDP and FAST processing from the central processor and transfer it to specially optimized blocks of the board. In the presented system, the entire processing cycle is implemented at the hardware level, with the exception of trading decisions, including an extremely flexible microcode engine for processing FAST messages. The approach provides a significant reduction in latency by more than 70% compared to software solutions and at the same time makes it much easier to change or add processing for new protocols than application-specific integrated circuits.

Background


The current section talks about the basic concepts and implementation of exchange infrastructure. In addition, the basic features of the FAST protocol will be described to determine hardware implementation requirements.

Infrastructure for trading

A typical trading infrastructure consists of several components controlled by various participants. This is the exchange, the provider of the feed, and market participants, as shown in Figure 1. The engine for matching applications (matching), the data center router, and the gateway are controlled by the exchange. The feed mechanism is provided by another organization. The router for access of market participants is the property of these participants, in this case HFT firms.



The procedure is as follows:
  1. An opportunity is created in the match engine
  2. Matching engine sends notification to feed provider
  3. The feed engine transmits this information to all participants with a multicast
  4. The client evaluates the situation and responds with an application
  5. The gateway receives an application and passes it to the match engine
  6. The matching engine executes the first application received, and the transaction is considered completed

Protocol stack

For the implementation of electronic commerce you have to use several levels of the network stack. An illustration of these layers is shown in Figure 2.



  1. Ethernet (ETH) layer
    The lowest layer of the network stack. Provides basic functionality for sending packets over a network. Contains frame support and Cyclic Redundancy Check (CRC) to support data integrity. In addition to this, ETH provides authentication of both participants using Media Access Control (MAC) addresses.

  2. IP layer
    IP is one level above ETH. A widely used protocol, groups computers into logical groups and assigns each end node a unique address within such a group. It can also be used to send messages to many participants at once (multicast), and thanks to this it is used on the vast majority of trading floors.

  3. The
    User Datagram Protocol (UDP) UDP layer is used by software applications to send messages between participants. Allows the use of multiple recipients and senders on a single node and uses data integrity checking using a checksum.

  4. FAST
    The FAST protocol was developed to transfer market data from the exchange to users via feeds. The protocol describes various fields and operators for identifying securities and their prices. An important aspect of this protocol is the ability to compress data to reduce the amount of data transmitted, however, this introduces an additional load on the processor. Processing feeds is a pretty bottleneck, and so putting this functionality on the FPGA is a good idea. A more detailed description of the protocol will be provided in the following parts of the document.

  5. Making trading decisions
    Making trading decisions can be a very difficult and costly task, depending on the complexity of the algorithms. In essence, the decision-making process consists in comparing the given parameters with the incoming data, using mathematical methods or statistics. A detailed analysis of such mechanisms is beyond the scope of this article. Due to the high complexity, the adoption of trading decisions is not made on the FPGA, but is implemented by software.


FAST Protocol Basics

The FAST protocol is used by the feed provider to transmit financial information to market participants. To reduce the load, several FAST messages can be combined into one UDP packet. These messages do not use framing and do not contain information about the size of the message. Instead, each message type is described in a template that must be known in order to properly process the data stream. Most feed providers declare their own protocol implementation by supplying their own independent specifications for the templates. Processing should be carried out with due care, as an error in one message will result in the loss of the entire UDP packet. Templates describe a set of fields, sequences, and groups. Groups it is a set of fields where each field can be contained only once,



Each message begins with a presence map (PMAP) and a template identifier (TID) as shown in Figure 3. PMAP is a mask and indicates which of the declared fields, groups, and sequences are contained in the current stream. Fields can be either mandatory or optional, and may also contain an operator associated with them. Moreover, this fact depends on the presence attribute, as well as on the bit flag in PMAP, if an operator is bound to the field. This adds to the complexity because it is not known in advance whether PMAP processing is required or not.

The TID is used to identify the pattern that will be required to process the message. Templates are described in XML, for example,


In this example, the TID is 1, and the template consists of two fields and one group. The field type can be a string, an integer, a fractional number, 32 or 64 bit, as well as a decimal type, where 32 bits will be allocated to the exponent and 64 bits to the mantissa.

To reduce the load on the network, only those bytes that contain explicit values ​​are transmitted. For example, only one byte will be transmitted for a 64-bit integer with a value of '1', despite the fact that in memory this number occupies 64 bits.

Only the first seven bits of each transmitted byte are used to record useful data, the eighth bit is used as a stop bit and serves to separate the fields. The stop bit must be deleted, and the remaining 7 bits must be combined if the field consists of more than one byte.

Suppose the following binary stream was received:

     1 0000111 0 0101010 1 0111111

Based on the underlined stop bits, it follows that these three bytes contain two fields. To get the value of the first field, replace the eighth bit with the value 0. The result will be:

     Binary value: 0 0111111
     Hexadecimal value: 0x63

The second field consists of 2 bytes. To get the real value, it is required in each byte to replace the eighth bit with 0 and combine the remaining parts. The result will be:

     Binary value: 00 000011 10101010
     Hexadecimal value: 0x03 0xAA

In fact, it is still more complicated, for example, fields must be processed differently, depending on their presence attribute. An optional integer must decrease by one, and an optional integer should not.

Operators prescribe the execution of an action with a field after it has been received, or if the field is not represented at all in the stream. The available operators are constant, copy, default, delta and increment. For example, constant, says that the field always contains the same value, and therefore, is never transmitted in the feed.

The FAST specification also provides for the declaration of a byte array that does not use stop bits, and all eight bits in a byte are used to transfer useful data. This array in turn contains a field in which the total number of bytes is set. This part of the specification was not implemented due to the high complexity, and the fact that it is not used on the target exchange, in this case in Frankfurt.

Similar works


Recently, a lot of literature on the topic has appeared, but nevertheless, most of the work is carried out within organizations and is not published. A very good overview of HFT optimization is given in [6]. Moreover, a system has been provided to optimize the processing of the ORPA FAST feed using the Myrinet MX device. The multi-threaded “faster FAST” engine, for use on multi-core systems, is presented in [7]. Although both systems are focused on optimizing FAST processing, the approach in this article is the first to use FPGA. Morris [8] describes an HFT system that uses FPGA to optimize UDP / IP processing, similar to [9]. Other related topics are the use of FPGA for algorithmic trading techniques based on the Monte Carlo method [10], [11], [12]. Sadoghi [13] uses FPGA-based technologies to efficiently process events in algorithmic trading. Mittal presents a software FAST decoder that runs on the PowerPC 405, which is built into several FPGA boards [13]. Lastly, Tandon published A Programmable Architecture for Real-time Derivative Trading [14].

Implementation


Three different approaches are proposed to reduce latency in HFT applications. The first is UDP optimization, the second is the hardware implementation of FAST message processing, and the third is the use of parallel hardware structures to process multiple threads simultaneously.

UDP optimization

In conventional UDP systems, data arrives at the network interface card (NIC) as raw ETH packets. The NIC then passes the packets to the OS kernel, which checks the checksum and processes the packet. The flow of execution for this process is shown in Figure 4. This approach shows high time delays, due to the fact that interrupts are used to inform the OS about incoming packets, and also because of the introduction of another additional layer, which serves to transfer data to user space - sockets . Moreover, the OS kernel may be occupied by other activities, which also adversely affects processing times. A detailed list of various time delays that appear during TCP / IP processing can be found in [15]. In the case of UDP, the situation is very similar,



The first and most effective approach to reducing latency is to bypass the OS kernel and process incoming packets in the hardware, as shown in Figure 5. Therefore, Address Resolution Protocol (ARP) support will be required to correctly receive and send packets. ARP is used to map IP addresses to the physical MAC addresses of participants.

To receive multicast, you will also need Internet Group Management Protocol (IGMP) support, which is required to connect and disconnect multicast groups. Other protocols that require implementation are ETH, IP, and UDP. All of these protocols are efficiently implemented, with a focus on the smallest possible delay for ETH, IP, and UDP. The implementation of IGMP and ARP protocols is not critical to speed, since they do not participate in direct data transfer during trading. All checksums of the headers of various protocols and ETH CRC are calculated in parallel, while verification of these checksums is performed at the very last stage.



The verified package is then transferred to the FAST handler or to the DMA engine, which is capable of writing raw data to the address space of the trading software. An example where FAST processing is performed on an FPGA, and then through a DMA engine is passed to a user application, is shown in Figure 6.



Hardware Processing FAST

Due to the high complexity of FAST, the decoder is divided into three independent parts. As shown in Figure 7, all three parts are decoupled through FIFO buffers to compensate for the different and sometimes unforeseen delays in each of the modules. Since decompression increases the amount of data to process, the FAST processor can operate at a higher frequency than the ETH core, which will increase throughput and compensate for the high amount of data.



  1. FAST decompressor The
    FAST decompressor processes stop bits and aligns each field to a length of 64 bits. This is done so that all fields have the same size, which simplifies processing in the subsequent steps.

  2. FAST microcode engine
    Since FAST messages vary widely, depending on the template used, it was decided to use a microcode engine that is flexible enough to be used with any variation of the FAST protocol. The use of partial reconfiguration, instead of the microcode engine, was rejected due to the high time delays of reconfiguration of several milliseconds.

    When loading the device, this engine launches a program located on the FPGA, with a separate procedure for each template. The jump table contains pointers to the necessary procedure, depending on the template ID, this ID is indicated at the beginning of each FAST message. All fields of the FAST message are processed by the corresponding routine. Depending on the subprogram, the contents of the field are either discarded or transferred to the next module for processing. The binary code for the microcode engine is implemented in the form of an assembler language using a simple domain-specific language (DSL), which was specially developed for these purposes. The assembler code for the template that was presented in the example above will look like this:



    As you can see, the proposed subject-oriented language contains four columns. The two left columns describe the data that should be processed in the indicated step. Two right - indicate a command that must be executed by the microcode engine. In particular, the first column describes the field with its availability attribute, the second maps the field to a unique identifier that can be further processed by the software. The third column announces a specific control command that can increment the current pointer, jump over commands, select data from the FIFO buffer, or check PMAP. The last column is used to indicate the purpose of the transition.

    Using the assembly language in conjunction with the microcode engine simplifies FPGA adaptation when changing templates, or even when moving to another exchange platform, without having to rethink the architecture. This greatly facilitates the work with FPGA with any protocol changes.

  3. FAST DMA Engine
    To supply the processed data to the trading software, the DMA engine was developed. Each field in various templates is marked with a unique identifier for its unique definition in the trading software. The DMA engine transfers the received information to the ring buffer, which is located in the address space of the software. Each entry consists of eight quadruple words, exactly this value is equal to the size of the cache line in the x86 processor. Seven of these quaternary words are used as the content of the fields, and the eighth contains seven identifiers for the fields presented in the second column of the table, as well as additional status information. The status information also contains an extra bit, which is inverted every time a write to the ring buffer occurs. Using this bit allows you to effectively determine

    This implementation allows you to get the lowest possible level of time delays when using constant polling of the buffer (polling). Polling shows much shorter delay times than interrupts. The main reason for high delays when using interrupts is that you have to switch the execution context. However, the chosen approach leads to an increased load on the CPU and increased power consumption, but the low delay times outweigh these disadvantages.


Parallelism

FAST protocol is strictly sequential, since the beginning of each field in the stream depends on the previous field. Even after the FAST decoder aligns the data, it will still be impossible to process the stream in parallel. These are protocol limitations.

Fortunately, exchanges deliver data to multiple feeds at once. Therefore, you can increase throughput and reduce latency by parallel processing of FAST feeds by the processor. Each FAST processor operates with a separate FAST stream. Throughput can be significantly increased thanks to this approach. Each FAST processor uses a scheme of 5620 lookup tables.



Computer interface

To further reduce delays when counting in nanoseconds, the HyperTransport interface offered by AMD for Opetron processors is used. This technology allows achieving much less delays than PCI Express, due to the fact that FPGA is directly connected to the central processor bypassing various intermediate buses [16]. This technology uses the HTX slot in AMD Opetron systems.

Ultimate architecture

The device was implemented using synthesized verilog RTL code constructs and tested on an FPGA card based on the Xilinx Virtex-4 FX100.

Figure 8 schematically illustrates the architecture of an HFT device. The leftmost is an HT core operating at a frequency of 200 MHz. Next is HTAX, connecting FAST processors, UDP and RegisterFile with the HT core. This part operates at a maximum frequency of 160 MHz. On the right side, the infrastructure for processing network packets connected to the Gigabit Ethernet MAC, which operates at a frequency of 125 MHz, is shown in red. The entire architecture, including 4 FAST processors, uses 77 slices.

Performance


To perform the measurements, the recorded data supplied by the feed provider was sent to the standard server network card located first to determine the starting point for subsequent calculations. Figure 9 shows the message flow in a similar system. The measured delay for the NIC is 9 µs, for the kernel space is 1.8 µs, and for user space is 2 µs. The total delay is 12.8 μs.



The same feed was then sent to the FPGA device. In the first experiment, only the feed passage through the nuclear space was removed, as shown in Figure 10. The measured delay when passing the network stack on the FPGA board is 2 μs, the user space delay is 2.1 μs. Thus, the delay value has already decreased to 4.1 μs, which reduces the execution time by 62 percent, compared with the standard implementation.



In the second experiment, we turned on the FAST processor, which allowed us to process messages in the hardware device and then transfer the result to the address space of the software, this process is shown in Figure 11. The measured delay, when both the network stack and FAST processing were moved to FPGA, turned out to be 28 percent less , and is only 2.6 μs. As a result, a 4-fold reduction in delay has been achieved compared to the standard solution.



Conclusion


An implementation of a device aimed at optimizing HFT was shown, which can significantly reduce the amount of time required to receive and process trade information based on FAST. In particular, delays were reduced by four times, which significantly reduced the round trip time for transactions.

The special nature of the FAST protocol, which does not allow its efficient processing on standard CPUs, makes it very attractive for implementation on FPGAs, with a significant reduction in time delays. The successful implementation of this approach was shown in this article.

References

[1] JA Brogaard, “High Frequency Trading and its Impact on Market Quality,” 5th Annual Conference on Empirical Legal Studies, 2010.
[2] M. Chlistalla, “High-frequency trading Better than its reputation ?,” Deutsche Bank research report, 2011.
[3] A. Group, New World Order: The High Frequency Trading Community and Its Impact on Market Structure, 2009.
[4] KH Chung and Y. Kim, “Volatility, Market Structure, and the Bid-Ask Spread, ”Asia-Pacific Journal of Financial Studies, vol. 38, Feb. 2009.
[5] J. Chiu, D. Lukman, K. Modarresi, and A. Velayutham, “Highfrequency trading,” Stanford University Research Report, 2011.
[6] H. Subramoni, F. Petrini, V. Agarwal, and D. Pasetto, “Streaming, lowlatency communication in online trading systems,” International Symposium on Parallel & Distributed Processing, Workshops (IPDPSW), 2010.
[7 ] V. Agarwal, D. a Bader, L. Dan, L.-K. Liu, D. Pasetto, M. Perrone, and F. Petrini, “Faster FAST: multicore acceleration of streaming financial data,” Computer Science - Research and Development, vol. 23, May. 2009.
[8] GW Morris, DB Thomas, and W. Luk, “FPGA Accelerated LowLatency Market Data Feed Processing,” 2009 17th IEEE Symposium on High Performance Interconnects, Aug. 2009.
[9] F. Herrmann and G. Perin, “An UDP / IP Network Stack in FPGA,” Electronics, Circuits, and Systems (ICECS), 2009.
[10] GL Zhang, PHW Leong, CH Ho, KH Tsoi, CCC Cheung, D. Lee, RCC Cheung, and W. Luk, “Reconfigurable acceleration for Monte Carlo based financial simulation,” IEEE International Conference on Field-Programmable Technology, 2005.
[11] DB Thomas, “Acceleration of Financial Monte-Carlo Simulations using FPGAs,” Workshop on High Performance Computational Finance (WHPCF), 2010.
[12] N. a Woods and T. VanCourt, “FPGA acceleration of quasi- Monte Carlo in finance, ”2008 International Conference on Field Programmable Logic and Applications, 2008, pp. 335-340.
[13] M. Sadoghi, M. Labrecque, H. Singh, W. Shum, and H.-arno Jacobsen, “Efficient Event Processing through Reconfigurable Hardware for Algorithmic Trading,” Journal Proceedings of the VLDB Endowment, 2010.
[14] S. Tandon, “A Programmable Architecture for Real-time Derivative Trading,” Master Thesis, University of Edinburgh, 2003.
[15] S. Larsen and P. Sarangam, “Architectural Breakdown of End-to-End Latency in a TCP / IP Network, ”International Journal of Parallel Programming, Springer, 2009.
[16] D. Slogsnat, A. Giese, M. Nüssle, and U. Brüning,“ An open-source HyperTransport core, ”ACM Transactions on Reconfigurable Technology and Systems, vol. 1, Sep. 2008, pp. 1-21.
[17] G. Mittal, DC Zaretsky, P. Banerjee, “Streaming implementation of a sequential decompression algorithm on an FPGA,” International Symposium on Field Programmable Gate Arrays - FPGA09, 2009.
[18] FIX adapted for Streaming, www.fixprotocol .org / fast

Also popular now: