bevzuk June 26, 2019 at 18:32

Be like Munch, or a few words about technical duty

Feeling of death, loneliness, at the same time, a crazy thirst for life ... You might think that we decided to give a lecture on expressionism and immerse you in Munch’s work. But no. You go through all these stages at the moment when you see that your technical debt will soon push your company into the abyss of crisis.

For 8 years, the Dodo Pizza IT team has grown from 2 developers serving one country to 80 people serving 12 countries. Three years ago, I joined Dodo Pizza as Chief Agile Officer and began helping teams create processes and implement engineering practices. Often these implementations were too slow. In addition, it was found that when several teams work on the same product, it is difficult to get them to maintain high quality code.

We pursued the development of business functions, postponing the technical perfection of the code for later. So we were trapped. A huge technical debt brought a fist over us, but didn’t crush it, but only, with a snap of our fingers, threw our company into the abyss of crisis. In 2018, the marketing team launched a massive advertising campaign, we could not bear the load and fell. Shame, shame and shame. But during the crisis, we realized that we can work many times more efficiently. The crisis forced us to quickly implement the most famous engineering practices and revolutionize processes.

Background

Dodo Pizza is a cyborg company that sells pizza . Our business is based on the Dodo IS platform, which manages all business processes: receiving orders, pizza preparation, inventory management, people management (management) and much, much more. In just 8 years, we have grown from 2 developers serving one pizzeria to 80+ developers serving 498 pizzerias in 12 countries.

Three years ago, Dodo IS was a monolith containing 1 million lines of code. There was a little coverage with unit tests, there were no API / UI tests at all. The quality of the code itself was disappointing. Everyone knew about this, or at least guessed. In dreams of a brighter future, we split the monolith into a dozen services and rewrote the most disgusting parts of the system. We even drew a diagram of the “future” architecture, but, frankly, did not do anything to get closer to it.

The more the team grew, the more we suffered from the lack of a clear process and engineering practices. Releases became more and more, because all six development teams simultaneously made changes in different branches. When teams merged their changes in one branch, we sometimes lost up to 4 hours trying to resolve merge conflicts. There were no automatic regression tests, and with each release we spent more and more time on manual regression.

Shit happens

In 2018, the marketing team launched our first federal TV advertising campaign with a budget of 100 million rubles. It was a great event for Dodo Pizza. The IT team was also well prepared for the campaign. We have automated and simplified our deployment - now with a single button in TeamCity we could deploy a monolith in 12 countries. Using performance tests, we conducted a vulnerability analysis. We did our best, but screwed up anyway.

The advertising campaign was amazing. We received from 100 to 300 orders per minute. That was good news. The bad news: Dodo IS could not withstand such a load and died. We reached the limit of vertical scaling and could no longer process orders.The system rebooted every 3 hours. Each minute of downtime cost us tens of thousands of rubles, not counting the loss of respect from angry customers.

When I arrived at Dodo Pizza three years ago, I immediately began to implement engineering practices. Most teams adopted pair programming, unit testing, and DDD pretty quickly. But not everything was so simple. I had to overcome the resistance of the developers, products and support team.

Unlike the ideas of engineering practices, at first not everyone supported the idea of feature teams. Developers are used to thinking that a team focused on one component writes the best code. It was unclear how to combine the rapid development of business functions with the long overdue massive refactoring of a complex system. Also, this endless stream of bugs constantly required attention ... We released the product no more than once a week, and each release took quite a lot of time, it required a huge amount of manual regression and support for UI tests. I tried to fix it, but the process change was too slow and fragmented.

The story of the fall and rise

Initial state: monolithic architecture

In pursuit of the speed of development of business functions, we did not always think through technical solutions well. Affected by a lack of experience. We had a monolithic application with a single database containing all the data from all components in one place. Tracker, accounting, website, API for landing pages - all components of the system worked with one database, which was a bottleneck.

True story

Monolithic architecture is good to start with, because it is simple. But it cannot withstand a high load, being the only point of failure. Once all our restaurants in Russia stopped accepting orders due to a blog post. How could this happen?

Our CEO, Fedor, posted a post on his blog. This post quickly gained popularity. Fedor’s blog site has a counter showing the number of pizzerias in our network and the total income of all pizzerias. Each time someone read Fedor’s blog, the web server sent a request to the master database to calculate revenue. These requests overloaded the database so much that it stopped serving requests from the restaurant cash desk. We quickly fixed the problem, but this was one of many signs that our architecture was not able to meet the needs of the business and should be redesigned. However, we continued to ignore these signs.

Early crash in 2017

The 14th of February. For lovers of congratulations on February 14, we make a special pizza - Pepperoni in the shape of a heart. I will always remember February 14, 2017, because on this day, when all the pizzerias were working at full load, Dodo IS began to fall. Each pizzeria has 4-5 tablets for production management: for what order the pizza maker rolls the dough, puts the ingredients, bakes or sends it for delivery. At that time, the number of pizzerias reached 150+, each tablet was updated several times a minute. All these queries created such a huge load on the database that it ceased to withstand and began to fail. Dodo IS died during peak sales. But there was a busy holiday season ahead: February 23, March 8, May 1 and 9. During these holidays, we expected even greater growth in orders.

The day you die . Knowing our growth plans and the load limit that we can withstand, we found out how long we can stay alive. The estimated date of Armageddon was expected in about six months: in August – September 2017. What is it like to live, knowing the date of your death?

Stop the development of functions for a year. Together with CEO Fedor, we had to make a difficult decision. Perhaps one of the most difficult decisions in the history of the company. Over the next year, we made only one business feature. The rest of the time the teams paid off technical debt. This debt cost us dearly - more than 100 million rubles only for developers' salaries.

Some improvements after a year

Over the year, we have grown markedly:

We automated and accelerated the deployment process to 4-5 hours
Finally, we started to saw the monolith: the tracker and TV boards were moved to a separate service with its own database
We began to separate the cash desk of delivery - the second component that created a high load
Rewrote user and device authentication system

It would seem that we could be proud of ourselves. But ahead of us was a huge disappointment.

Failure during the federal advertising campaign. Second crisis of confidence

Technical debt is easy to accumulate, but very difficult to repay. It is unlikely that you will be able to understand in advance how much it will cost you.

Despite the fact that we struggled with a technical backlog for a whole year, we were not ready for a mass marketing campaign and screwed up in front of our business again. The trust that we earned drop by drop disappeared.

Under the load of the Federal Marketing Campaign, we lay down again. The system crashed again and rebooted every 3 hours. Our business was losing tens of millions of rubles.

Thanks to the crisis, we learned that in extreme conditions we can work many times more efficiently. We are released 20 times a day. All worked as one team, focusing on one goal. During the two crisis weeks, we did what we were afraid to even start doing earlier, believing that it would take months of work. Asynchronous reception of orders, disabling orders, stress tests, clean logs - this is only a small part of what we have done. We wanted to continue to work just as efficiently, but without overtime and stress.

Lessons learned

After the retrospective, we completely reorganized our processes. We took LeSS as a basis and supplemented it with engineering practices. Over the next few months, we made a breakthrough in introducing engineering practices. Based on LeSS, we have implemented and continue to use:

Single Product Backlog
Fully Cross Functional and Cross Component Commands
Pair and mob programming
True Continuous Integration (CI) - Integration of code with 12 teams in one branch
Simplified work with branches (trunk-based development)
Frequent releases: continuous deployment for microservices, daily release for monolith
Refusal of a separate QA team, QA experts are part of the development team

6 practices that we chose after the crisis:

1. The power of focus. Before the crisis, each team worked on its own debt and specialized in its field. During the crisis, the teams did not have specific tasks; they had one big difficult goal. For example, a mobile application and API must process 300 orders per minute, no matter what. The team takes the goal and independently thinks how to achieve it. The team itself formulates the hypotheses and quickly tests them on the prod. Teams do not want to be simple coders, they want to solve problems.

The power of focus is manifested in complex tasks. For example, during the crisis, we created stress tests, despite the fact that we had no experience. We also made the logic for receiving the order asynchronous. We thought about it for a long time and talked, and it seemed to us that this is a very difficult task, which can take a lot of time. But it turned out that the team is quite capable of doing this in 2 weeks, if it is not distracted and completely focus on the problem.

2. Internal hackathons. We carried out the 500 Errors Hackathon. All teams together cleared the logs and removed the causes of 500 errors on the site and in the API. The goal was to keep the logs clean. When the logs are clean, new errors are clearly visible, you can easily set thresholds for alerts.

Another example of a hackathon is bugs. Previously, we had a complete backlog of bugs, some of them hanging out there for many years. They never seemed to end. And every day new ones appeared. We combined work on bugs and the usual backlog elements.

#Zerobugspolicy policy.

If the bug has been in the backlog for more than 3 months, just delete it. He had lain there for ages, and no one died.
Assess the pain that the remaining bugs cause customers. Leave only those bugs that make life difficult for a large group of users.
Arrange an internal hackathon for bugs. We did it in a few sprints. Each sprint, each team took several errors and corrected them. After 2-3 sprints, we had a clean backlog. Now you can enter #zerobugspolicy.
#zerobugspolicy. If the bug gets into the backlog, it will definitely be fixed. Any bug in the backlog has a higher priority than any other backlog element. But in order to get into the backlog, the bug must be serious. Either it does irreparable harm, or affects a large number of users.

3. From project teams to a stable team. There was a funny story with project teams. During the crisis, we formed expert teams of people who were most qualified for the task. After the crisis ended, the teams decided to continue this practice. Despite the fact that I did not like this idea at all, we tried. In just 2 weeks (one sprint), in the next retrospective, the teams abandoned this practice (this decision made me happy). If a team lacks some skills, they can gradually learn. But team spirit, support and mutual assistance take a very long time to complete, it takes months. Short-term project teams are constantly at the stage of formation and storming. You can tolerate this for several weeks, but you will not be able to work this way all the time.

4. No manual regression. We set a goal to get rid of manual regressions. It took us 1.5 years to reach it. But having a long-term ambitious goal makes you think about the steps leading to the goal.

We did it in 3 steps.

Critical Path Automation.
In June 2017, we formed a QA team. The task of the team was to automate the regression of the most critical functionality of Dodo IS - the receipt and production of orders. Over the next 6 months, a new 4-person QA team covered all critical system functionality with automatic tests. Feature team developers actively helped the QA team. Together we wrote a beautiful and understandable domain language (DSL), which was understood even by customers. In parallel with end-to-end tests, developers weighted the code with unit tests. Some new components have been redesigned using TDD. After that, we disbanded the QA team. Former members of the QA team joined the teams working on business features to transfer the experience of developing and supporting autotests to teams.
Shadow mode.
Having autotests, during 5 releases we did manual regression in shadow mode. The teams relied only on automatic testing, but when the team decided that it was ready for release, we launched a manual regression to check if our autotests had missed any errors. We tracked how many errors were caught manually and not caught by auto tests. After 5 releases, we analyzed the data and decided that we can trust our auto tests. No major errors were missed.
Refusal of manual regression.
When we had enough tests to begin to trust them, we completely abandoned manual testing. The more tests we write, the more we trust them. But this only happened 1.5 years after we began to automate regression testing.

5. Stress tests are part of the regression. During the crisis, we wrote stress tests. This was a completely new experience for us. However, in just 2 weeks, we were able to create something using Visual Studio tools. We used them, including to generate artificial load on the server, in order to find performance limits. For example, if the organic load on the prod is 100 orders / min, we added another 50 orders / min using our tests to see if the system is able to handle the increased load.

The following year, we rewrote stress tests with an experienced PerformanceLab team. Today, these tests run weekly and provide quick feedback to development teams.

6. Engineering practices.All our teams use pair programming. I consider pair programming to be one of the simplest but most powerful practices. If you do not know what engineering practice to start with, I recommend pair programming.

results

The main result for us was a shake-up. We woke up and started acting. The crisis helped us see our maximum potential. We saw that we can work many times more efficiently and quickly achieve our goals. But for this it is necessary to change the usual way of working. We are no longer afraid of bold experiments.

As a result of these experiments over the past year, we have significantly improved the quality and stability of the Dodo IS. If during the spring break of 2018 our pizzerias could not work because of Dodo IS, then in 2019, with an increase from 300 to 498 pizzerias, Dodo IS works flawlessly. We calmly survived the peak of sales in the new year, during the Second marketing campaign and spring holidays.

For the first time in a long time, we are confident in the quality of the system and can afford to sleep soundly at night. This is the result of continued use of engineering methods and a focus on technical excellence.

Business Results

Engineering practices are not needed on their own if they do not benefit your business. As a result of focusing on technical excellence, we improve the quality of the code and develop business functions with predictable speed. Releases have become a common event for us.

Results for Teams

Today we use a wide range of engineering methods:

Fully Cross Functional and Cross Component Commands
Pair / Mob Programming
Continuous Integration - continuous integration of 12 commands into one branch
Subject Matter Expert as a Team
There is no separate QA team, QA experts are part of the development teams
Replacing manual regression with autotests
No Bug Policy (#Zerobugspolicy)
Stop the Line as a driver to accelerate deployment

What have we learned

I would like the crisis not to happen. As a developer, I felt personally responsible for accumulating too much technical debt and for not being able to foresee the consequences.

Engineering practices protect business from crisis
Do not accumulate technical debt. It can get too late and cost too much
Evolutionary changes take several times longer than revolutionary
A crisis is not always a bad thing. Use crisis to revolutionize processes
However, lengthy evolutionary training is required in advance.
Do not blindly apply all the methods that you like. Some methods are waiting in the wings, and when he arrives, the teams will use them without resistance. Wait for the right moment
Over time, the teams themselves begin to make important decisions and implement them. Give them an enabling environment to try, let them fail and learn from mistakes

Technical debt has led us to a terrible crisis. I am very glad that our team found the strength to use this stalemate as a growth point. In our own skin, we realized that the time of crisis can and should be used for massive organizational and process changes. So never give up, because even in the most difficult situations there is room for a feat.

Acknowledgments

I would like to say a big thank you to all the people who helped me on my journey from crisis to LeSS transformation. I constantly feel your support.

Many thanks to our CEO Fedor Ovchinnikov for his trust. You are a true leader in a company with a true, flexible culture.

Many thanks to Dmitry Pavlov, our Product Owner, my old friend and co-trainer.

Thanks to Alexander Andronov and Andrey Morevsky for their support.

Many thanks to Dasha Bayanova, our first full-time Scrum-master, who always helps and supports me with all our initiative. Your help is hard to overestimate.

Special thanks to Joanna Rothman, who helped me write this report in any condition: on vacation, recovering from an illness. Joanna, it was a pleasure working with you. Your advice, attention to detail and hard work helped me a lot.

Tags: