Nightmare of the “Knight”: an instructive story about DevOps
1066, more than 200 years have passed since the beginning of the Viking invasion of England. King Harold, collecting a detachment of knights, marched to the Derwent River for a decisive battle with the troops of his namesake - the Norwegian king Harald. The gunsmiths worked for a month to forge enough new-generation armor that could protect the knight from being hit by a Scandinavian ax. And how many experiments and trials in tournaments had been before! But the expectation was supposed to justify itself - lightweight but reliable equipment allowed even on foot, without large losses, to sweep the Viking hird. And finally, they met at Stamford Bridge. The main detachment of knights, led by a commander in shiny armor, clashed in the middle of a bridge with enemies. Yes, the steel of the podgorny masters holds the punch!
Slowly but surely, the Vikings are moving to a round-robin defense. Victory seems to be near. And finally, on the battlefield, the knight commander and the Norwegian jarl find each other.
The two-handed Jarl's ax is already broken and he is forced to defend himself with an ordinary Saxon, which can not be compared to a knight and a half sword. A wave with a dagger - for the commander’s armor it’s like a penknife, but it passes through the armor and ... it seems that magic has come into play - the armor of neighboring knights is scattered! Another wave, again on target, and now at the knights on the left flank the armor suddenly glows red-hot. The third blow - the knights swim before their eyes, they stumble, fall, and never rise again.
"*** ** **** ****!" Cried Petrovich, waking up at 5 in the morning on Monday in a cold sweat. Everything went completely wrong: at 11 pm he climbed into Wikipedia in search of materials for the child’s report on the nature of the tundra, but by the time of the night he somehow found himself on the description of the Viking invasion of England. And also, on the weekend they put on the battle the next release, which should start today. As always, we tested it for a long time and thoroughly, drove it through continuous integration systems a million times, whether everything goes smoothly ...
Fortunately for some, but unfortunately for the victims, there is always the opportunity to learn from other people's mistakes in order to improve something and get additional portion of confidence. With this translated article we want to once again, in more detail, recall one of the cases in "our" industry.
Last year at the conference I talked about DevOps, configuration as code, and continuous delivery. Using the story below, I explained the importance of creating fully automated and reproducible deployments as part of the DevOps / Continuous Delivery initiative. After the conference, several people asked me to share a story on a blog. This is an absolutely true story. This is a retelling of what I read, I myself did not participate in this.
So, the story of how a company with assets of almost $ 400 million went bankrupt in 45 minutes due to an unsuccessful deployment.
A little background
Knight Capital Group (“Knight” in English means “knight”) is an American global financial company engaged in market-making, electronic execution, institutional sales and trading. In 2012, Knight was the largest US stock trader with a market share of about 17% on the NYSE and NASDAQ. Knight's Electronic Trading Group (ETG) managed an average daily trading volume of over 3.3 billion transactions per day, trading over $ 21 billion ... per day. It's not a joke!
As of July 31, 2012, Knight had approximately $ 365 million in cash and cash equivalents.
On August 1, 2012, the NYSE planned to launch a new retail liquidity program, the Retail Liquidity Program (a program designed to improve pricing for retail investors through retail brokers such as Knight). In preparation for this event, Knight updated their automated, high-speed, algorithmic router SMARS, which sends applications to the market for execution. One of the main functions of SMARS is to receive applications from other components of the Knights trading platform (“parent” applications), followed by sending one or more “child” applications for execution. In other words, SMARS will receive large orders from the trading platform and break them into several small ones in order to find a buyer / seller for stocks. The larger the parent application,
The SMARS update was supposed to replace the old, unused code called “Power Peg” - this functionality Knight had not used for 8 years (why the code that was dead for so long was still in the code base is a mystery, but this is not the main thing). The updated code reassigned the old flag that was used to activate the Power Peg functionality. The code was thoroughly tested, worked correctly, and was reliable. What could have gone wrong?
What could have gone wrong? And really!
Between July 27, 2012 and July 31, 2012, Knight manually deployed new software on a limited number of servers per day — a total of eight (8) servers. Here's what the SEC document says about manual deployment (SEC is the Securities and Exchanges Comission, the American stock market regulator).
“During the deployment of the new code, one of the Knight employees did not copy the new code to one of the eight SMARS servers. Knight did not conduct a second technical review of this deployment, so the Power Peg code was not deleted from the eighth server, and the new RLP code was not added. The company did not have procedures in place requiring re-examination. ” Release No. 70694, October 16, 2013
On August 1, 2012 at 9:30 a.m. EST, markets opened, and Knight began processing requests from broker-dealers on behalf of its customers in the new Retail Liquidity Program. Seven (7) servers that were deployed correctly began to process applications correctly. And those applications that went to the eighth server probably activated the changed flag and resurrected Power Peg.
Zombie Attack: Killer Code
Here you need to explain why the “dead” Power Peg code was needed. This functionality was intended for counting shares bought / sold on a parent's request as the child's orders are completed. After the parent application is executed, Power Peg prohibits the submission of child applications. In principle, Power Peg will track the child orders and stop their execution after the processing of the parent application. In 2005, Knight rolled back this cumulative tracking functionality to an earlier stage of code execution (thus removing the quantity tracking from Power Peg).
When the Power Peg flag on the eighth server was activated, Power Peg began to route child orders for execution, but did not correlate them with the number of shares in the parent order - a kind of closed loop arose
Infernal 45 minutes
Imagine: you have a system that can send automatic high-speed applications to the market without any tracking and the ability to see if enough applications have been completed. Yes, everything turned out so bad.
When the market opened at 9:30 a.m., people quickly realized that something was wrong. By 9:31 a.m., many on Wall Street had realized that something serious was happening. The market was flooded with bids with an unusual, compared with the normal situation, trading volume for certain shares. By 9:32 on Wall Street, they wondered why this disgrace does not stop. Almost forever in high-speed trading. Why didn't someone click the kill button on the system that did it? As it turned out, there was no switch. During the first 45 minutes of trading, execution of transactions from Knight amounted to more than 50% of the trading volume, raising certain shares up by more than 10% of their value. As a result, other stocks fell in value due to erroneous transactions.
And to make matters worse, the Knight system started sending automated e-mails even before these events - as early as 8:01 a.m. (when SMARS processed orders suitable for pre-market trading). In messages, the system referred to SMARS and showed the error "Power Peg is unavailable." Between 8:01 a.m. and 9:30 a.m., 97 letters were sent to Knight employees. Of course, these letters did not look like system warnings, so no one looked at them right away. Oh.
For hellish 45 minutes, Knight tried to stop the erroneous transaction. It was not possible to turn off the system (as there were no documented procedures for responding to such a situation), therefore, trying to deal with the problem in live trading conditions, they remained in the market, where 8 million shares were sold every minute. Since the company's employees could not determine where the erroneous applications came from, they removed the new code from the servers where it was deployed correctly. In other words, they deleted the working code and left broken. This only exacerbated the problems causing additional parental requests to activate Power Peg code on all servers, and not just where the code was originally deployed incorrectly. In the end, I managed to stop the system - after 45 minutes of trading.
While trading was underway, the Power Peg code received and processed 212 parental requests. As a result, SMARS sent millions of subsidiaries to the market, completed 4 million transactions in 154 transactions with more than 397 million shares. For stock market connoisseurs, this meant that Knight bought shares of 80 different companies for $ 3.5 billion and sold shares of 74 companies for $ 3.15 billion. From the point of view of non-professionals, the Knight Capital Group lost $ 460 million in 45 minutes. But Knight has only $ 365 million in cash and cash equivalents. In 45 minutes, Knight has transformed from the largest trader of American stocks and a major market maker on the NYSE and NASDAQ into bankrupt. They had 48 hours to collect the amount necessary to cover losses (which they were able to do thanks to an investment of $ 400 million from about half a dozen investors).
What conclusions do you need to draw
The events of August 1, 2012 should be a lesson for all development teams and project teams. It’s not enough to create great software and test it; You also need to make sure that it is correctly delivered to the market so that your customers receive exactly the value that you provide (and that you do not bankrupt your company). The engineer (s) who deployed SMARS are not only to blame for the fact that the procedure followed at Knight did not take risks into account. The procedure (or its absence) was obviously erroneous. Each time the deployment process depends on how people read and follow instructions, you put yourself at risk. People make mistakes. Errors can be in the instructions, in the interpretation of instructions or in their implementation.
The layout should be automated, reproducible and as free from possible human errors as possible. If Knight had implemented an automated deployment system — a set of configuration, automatic deployment, and testing — the mistakes that turned into a Knight’s nightmare could have been avoided.
Here are a couple of principles of continuous delivery (even if you do not implement the complete process of continuous delivery):
- Software release should be a repeatable and reliable process.
- Automate everything you can.