
How to lose 45 minutes at $ 172,222 per second
- Transfer
This is perhaps the most painful bug report I have ever read. He colorfully describes the steps that led to the loss of $ 465 million by Knight Capital due to a software bug that occurred last year and bankrupted the company.
This report has all the technical debt characteristics in a huge, deprived of support and a running code base (the error occurred due to the execution of code that had not been used for almost 9 years) and a terrible and sad history of interaction between software developers and IT professionals.
Highlights:
Then even more fun:
Of course, it is worth reading the entire document, it pays great attention to new verification procedures performed by people in order to avoid such a tragedy. The mistakes of the developers, of course, were associated with the human factor, but such consequences were the result of a poor deployment scenario and disgusting monitoring. What kind of office is this where they do not even check the cluster software version? Not to mention a deployment scenario in which return codes are checked.
We can only hope that “written verification procedures” of unused code implied systematic tests, although Wikipedia says this is not the case.
And for dessert: the fine was another $ 12 million, an audit showed that the system was constantly trying to make speculative short sales.
This report has all the technical debt characteristics in a huge, deprived of support and a running code base (the error occurred due to the execution of code that had not been used for almost 9 years) and a terrible and sad history of interaction between software developers and IT professionals.
Highlights:
To ensure the participation of its customers in the Liquidity Program (PL) on the New York Stock Exchange, which was scheduled to launch on August 1, 2012, Knight made a number of changes to its systems and program code related to the order processing process. These changes included the development and deployment of new code in SMARS. SMARS is an automated, high-speed, algorithmic router that sends orders to the market. One of the main functions of SMARS is to receive orders from other components of the Knight trading platform (“parent” orders), and, as necessary, based on available liquidity, send one or more representative (or “subsidiary”) orders to external services for execution.
13. When deploying, the new PL code in SMARS was supposed to replace the unused code in the corresponding part of the router. This unused code was previously needed for the Power Peg function, which the company has not used for many years. Despite this, it remained operational and called during the deployment of the submarine. The new PL code used a flag that was previously tied to the Power Peg. Knight wanted to remove the Power Peg code so that when this flag is activated, the new functionality of the submarine is used, and not Power Peg.
14. Earlier, when using Power Peg, the summing function calculated the number of shares in executed child orders and signaled the need to stop placing child orders after the parent order was completed. In 2003, Knight stopped using the Power Peg. In 2005, Knight changed the Power Peg code by moving the parental order tracking function to an earlier stage of the SMARS code sequence. We did not perform repeated testing of the Power Peg code after the Knight change, and we were not convinced that the procedure still works correctly.
15. Starting July 27, 2012, Knight deployed a new sub code in SMARS, placing it on a limited number of servers. When deploying the new code, one of the technicians did not copy the new code to one of the eight SMARS servers. Knight did not have a second technician to test the deployment, and no one understood that the Power Peg code was not removed from the eighth server and that no new PL code was added. There were no written procedures at Knight that would require such verification.
16. On August 1, Knight received orders from broker-dealers whose clients could participate in the sub. Seven servers processed orders correctly. But orders sent to server 8 with the launch flag set, launched the defective Power Peg code, which was still present on this server. As a result, the server took orders as parent and started sending child orders to trading centers. Due to the fact that the function of checking the completion of the parent order was moved to another stage of the process, the server continued to place child orders non-stop - not paying attention to the fact that the parent order has already been completed. Although some part of the order processing system determined that the parent order was completed, this information did not get into SMARS.
19. On August 1, Knight also received orders that related to submarines, but were intended for trading until the market opened. Six SMARS servers processed these orders and, starting around 8:01 AM, the internal systems generated automatic messages (called “BNET Failure”) that referred to SMARS and described the error as “Power Peg Disabled”. The Knight system sent 97 such messages until 9:30 in the morning, when the market opened. Messages of this type were not regarded by the system as dangerous, and the staff did not read them at all.
Then even more fun:
27. On August 1, Knight did not have any incident response procedures. In other words, the company did not have control procedures for personnel management when serious problems occurred. On August 1, Knight used the services of his team of technicians to identify and fix problems in SMARS in a live trading environment. The Knight system continued to send millions of “child” orders while staff tried to identify the source of the problem. The company even deleted the new PL code from the seven servers on which it was installed correctly. This exacerbated the situation, because the new "parent" orders activated the Power Peg code that was present on these servers, similar to what had already happened on the eighth server.
Of course, it is worth reading the entire document, it pays great attention to new verification procedures performed by people in order to avoid such a tragedy. The mistakes of the developers, of course, were associated with the human factor, but such consequences were the result of a poor deployment scenario and disgusting monitoring. What kind of office is this where they do not even check the cluster software version? Not to mention a deployment scenario in which return codes are checked.
We can only hope that “written verification procedures” of unused code implied systematic tests, although Wikipedia says this is not the case.
And for dessert: the fine was another $ 12 million, an audit showed that the system was constantly trying to make speculative short sales.