How to make an online store withstand a load of 280,000 visitors per hour?

Hello, Habr!

Unfortunately, at this stage in the development of web programming in our country, refactoring a project is more often perceived as a programmer’s work in the “all is bad” mode and is carried out only when the site is already in critical condition. We had to deal with a similar situation in 2012, when one large Russian online store came to our service with the following problem: starting at 10 a.m. the site every 5 minutes fell for 5-10 minutes and went up either with great difficulty or after a hard reboot . After reboots, the site worked a little, and then crashed again. Particularly acute was the fact that the New Year was approaching, the high season for all selling sites, and in this case, the phrase "in 10 minutes a company loses tens of thousands of dollars" was not a joke.

Let's take a break from our story and talk about the fact that developers are now exceptionally happy people. Finally, server capacities have become so cheap that any system with a lack of resources can be easily scaled to the desired size. Is something wrong with the programmed? Increased load per percent? Great, let's add the process to the server. Not enough operative? Let's add the operatives. The problem of lack of resources is not a problem.

Many of you remember very well the time when people sat on the dialup, thoughtfully listened to the sound of the modem, determining from the first notes whether they managed to connect or need to reconnect. At that time, the slow site simply closed after a minute, because it was very expensive to wait for it (in the literal sense of the word). Today, a slow site is a site that openslonger than 3 seconds excluding channel speed. The world is speeding up, time is expensive. But what if the site lacks speed? There are wonderful accelerators and more powerful and very cheap servers at the same time, there is a web cluster from the same 1C-Bitrix, for example, and in a pinch, a sea of any software for the same purposes. It seems that you can no longer monitor the quality of the code, drastically lower the level of developers and significantly save the company’s budget.

However, it only seems. In our reality, the reality of web developers, any of the most powerful hardware has its own endless cycle and a wonderful framework. And often there are situations when even the most powerful hardware does not pull the written code.

So it was in the case of our client, a large online store, mentioned at the beginning. We saw him have six powerful servers, each of which could be freely scaled. The most logical and fastest action at that moment was adding a couple more servers, which we did right away. However, the site still continued to fall as scheduled.

To get some kind of timeout for normal operation, we first decided to somehow stabilize the project so that it could survive the New Year, and then carry out its global reengineering and refactoring.

What did we have at the entrance?

We knew that load schedules on different parts of the system should be fairly smooth. For example, these:

This is due to the fact that in one second there are zero users on the site, and in the next - all 110,000 users of peak load.

However, what we saw on the client’s project was contrary to all our practice: all the schedules jumped and led a hectic life.
It looked something like this:

Apache:

It might be worth rejoicing at the increased load on the schedules - more visitors, more money - but Google Analytics said that the number of visitors not only did not increase, but even fell.

As a rule, in a stable and correct state of the system, an increase in the load on Apache is correlated with an increase in the load on MySQL. But in our case, everything looked different.

MySQL:

Now let's compare the graphs of MySQL and Apache:

Something obviously went wrong. And we started looking for a problem.

What did we find?

Firstly, we found a modified 1MS-Bitrix CMS core, and it was changed in those parts that work with the cache, which meant potentially incorrect operation with it. In principle, this was some logic - if the cache crashes and flushes, then at these moments there will be surges on Apache. But then there should have been more bursts in MySQL, but there were none.
Secondly, due to the changed CMS core, the project has not been updated for several years, since any update broke the site.
Thirdly, the server was configured incorrectly. Since our story is about refactoring, we will not dwell on this point in detail.
Fourth, the project was created around 2005-2006 and was calculated on a completely different load - 10 times less than the current one. The architecture of the project and the code were completely not designed for the increased load. Requests that under the old load were within the tolerance of 0.5 - 1 second, when the new load was already 4-15 seconds and fell into the slow log.
And fifthly: in the process of customer cooperation with various contractors over the years of the project, a lot of duplicated code appeared in it, extra cycles and an inefficient cache.

Actually, a little debugger, a couple of gigabytes of logs, a live analyst weighing at least 60 kg, salt, pepper, mix - there we got an excellent recipe for quick stabilization of the project. He was able to survive the peak of the new year and acquired a small, but still margin of safety.

How did they fix everything else?

Recall that the project was at 1C-Bitrix. First, we updated the version of CMS to the latest, which led, as we assumed at the beginning, to the fact that a significant part of the functionality stopped working, it needed to be restored. After basic analytics, it turned out that more than 1000 files were changed in the kernel and in standard components. After the upgrade, the first step was to restore the site to a fully functional version. But this allowed us to connect the Web cluster module with all the goodies.

In the second stage, we analyzed the project architecture, database queries and code and found that, due to the specific construction of the architecture, from 3000 to 4000 database queries without a cache are generated from one catalog page, with a cache about 200 requests. At the same time, it was reset non-optimally and with any content update.

To optimize database queries, we had to modify the structure of information blocks: to denormalize part of the data and move part of the data to separate tables. This allowed us to get rid of the heaviest joins with a execution speed of 30-40s and replace them with several quick selects. We also put down indexes on the most used data and removed unnecessary data remaining from the old structure. All this has significantly increased the overall speed of query execution.

We also added a bonus for subsequent developers of the project: in order to make the code easy to read in the future, we scattered thousands of sheets of code into separate classes and files, commented on them and cleaned out old obsolete files.

What happened in the end?

It took us about 6 months to carry out all the listed works in sequence. Then we conducted load testing on the site. The tests were both from an external network, and from within the cluster (in order to level the effect of channel speed on the test).

On the graphs of the test, you can see from the outside that before refactoring and reengineering the system with a load of 25 simultaneous users, the page loads about 8s, and after it only 4.2s.
Before:

After:

On the test graphs from the inside of the cluster, it can be seen that, for example, the main page was loaded for 40s with 200,000 visitors for 40s, and 1.9s began to load.
Before:

After:

What have we pleased our developers with?

We have reduced the number of queries to the database with and without a cache. Increased code readability. Simplified the internal structure of the project. Deleted obsolete data. Changed more than 14,000 files. Simplified project scaling. And in the final - they gave a significant margin to the project for increasing the load.

So that we would like to advise all web developers at the end of the article:

Use server health monitoring. Usually everything is visible there.
Use debuggers, follow the most resource-intensive operations and cycles.
Cache everything that is used constantly. Drop the cache in parts, not completely.
Keep track of the number of database queries with or without cache
Profile queries, monitor the amount of data transferred from the database to Apaches. Keep track of the logic and optimality of the query.
Arrange indexes on tables in a DB.
Update CMS.

Natalya Chilikina, Head of ADV / web-engineering bitrix development department

Tags: