How e-commerce survive major promotions. Getting ready for peak loads on the web [Part 2]

Hi, Habr!
A week ago, an article appeared in which I began a conversation about how to prepare an e-commerce project for the explosive growth of traffic and the other charms of large-scale campaigns.
We have dealt with key technical details, now we will pay attention to administrative issues and optimizing support processes during peak loads:

what makes the site unstable and why the cloud is not a panacea;
which business parameters need to be monitored in order to detect the problem before it causes significant losses;
how without chaos to route the incident from event to solution and localize the failure.

And much more - I ask everyone under the cat!

In my experience, the biggest headache in preparing for large-scale actions is a strong administrative pressure. A business that is so calm suddenly has a desire for everyone to be on the lookout, to blow dust particles off the site and so on, “God forbid what happens, let us fine”. Let's try to satisfy this overall sound desire. We will talk about this on the example of Black Friday, since this is the most striking example of a sharp increase in the load on the site.
And we begin with the fundamental question: what exactly is the cause of the unstable operation of our site?

What makes the site unstable

It is time to do what you have been postponing regularly for a long time. To understand what factors make the site less stable, raise and analyze the history of problems. Just do not say that you do not have it.

Your top will have plus or minus the following reasons:

Release accidents.
Admins got tired - they repaired one thing, and another broke. Unfortunately, such linings are often hidden and do not fall into the story.
I screwed up the business - they started the action crookedly, deleted something, etc.
Broke down partner services.
"Sad" software. Most often this happens because of paragraphs. 1 and 2.
Physical damage.
Other problems.

Of course, all situations are different, and your “rating” may turn out a little different. But the leaders will still be the problems associated with changes on the site and the human factor, as well as the fruits of their joint love - releases or attempts to optimize something.

To eradicate these problems so that on the first attempt to make the necessary changes and not to break down what works normally, is a task about which many copies are broken. And we have very little time, only about four months. Fortunately, you can handle it locally. To do this, follow a few simple rules:

1. Works - do not touch.

Complete all planned work as early as possible - in a couple of weeks, in a month. How early it is to get involved with improvements, your incident history will tell. It shows how long the main tail of the problems lasts. After that, do not touch the site and infrastructure of the product until it is overloaded.

2. If you still had to climb into the product for urgent repair - test.

Regularly, tirelessly, even the smallest and minor changes. First, in a test environment, including under load, and only then transfer to the prod. And again test and recheck the key parameters of the site. It is better to work at night, when the load is minimal, because you should have time to save the situation if something goes wrong. Good testing is an entire science, but even just reasonable.Testing is better than none. The main thing is not to hope for a chance.

Freezing changes at the time of high load is the only reliable means.

What to do with partner services, we have already discussed in the last article. In short - for any problems ruthlessly disconnect. Most often, problems arise at once with many users of the service, and contacting technical support is an ineffective measure. Your letters will not help them fix it faster, during such hours IT-service department is hot without them.

However, if you do not report a problem and do not receive an incident number with the time of its establishment, you will most likely not be able to charge a penalty for violation of the SLA to the service.

Little about reliability

In preparation, you need to change all the failing hardware and cluster services. Read more about this in one of my previous articles.

I’ll draw your attention to the following popular fallacy: it seems to many that transferring a site from their servers to the cloud immediately gives +100 to reliability. Unfortunately, only +20.

To increase the resiliency of a virtual server, a commercial cloud simply automates and speeds up the “replacement” of fallen hardware to a matter of seconds, automatically raising the virtual machine on one of the live servers. Keywords - "accelerates" and "fallen iron." The virtual machine will still be restarted. VMware Fault Tolerance and analogs that allow you to escape from a reboot, as a rule, are not used in commercial virtualization due to resource-intensiveness and reduced performance of protected virtual machines. Hence the conclusion: a commercial cloud is not a panacea for fault tolerance, its main advantages are flexibility and scalability.

Look in the history of how much time you had for the replacement or repair of physical equipment. After moving to the cloud, their number will decrease, and - yes, you will live a little easier. Do not have to run to the warehouse or store for a new server. But now virtualization tricks will be added to iron accidents.

It may happen that the machine has become unavailable, but the physical host responds anyway. The cloud will not see this problem. Or exactly the opposite: the host is not responding, but with virtual machines everything is fine. In this case, virtualization will raise them elsewhere. It will take some time to launch, and again you will get a simple out of the blue. And under load it can be fatal. Therefore, even in the cloud you need to remember about reservations. By the way, to warn the virtualization provider about which machines are reserving each other is a great idea. Otherwise, it may happen that all your machines will end up on the same physical server and die at the same time.

When conducting load tests, it makes sense to schedule failover testing under load.

This is when you “drop” a node in a cluster right during the load test and see what happens. With properly configured clusters and correctly allocated resources, this should not adversely affect the test results and cause a heap of errors.

It seems that with all the typical "drums" we are done. Before reading further, I recommend that you refresh the technical details described in the previous article . After all, if the site is technically not able to withstand the load, the reaction rate will not save you.

Now we will think how to prepare for the unusual or sudden. We cannot prevent them by definition, so it remains to roll up our sleeves and learn how to repair them as quickly as possible.

Stages of incident elimination

Consider what constitutes the time to eliminate the accident:

Failure detection rate - delay monitoring, receiving a letter from the user, etc.
Reaction time to the detected incident - someone should notice the report and deal with it.
Time to confirm the presence of the incident - was there a boy?
Time to analyze the incident and find ways to eliminate.
Time to eliminate the incident and the problem. It is not always possible to fix everything from the first time, and this stage may have several iterations.

Typically, the detection and elimination of failures involved in service support. If the team is large, each of these steps can be performed by different people. And time, as you know, is money. In our case, literally. Black Friday has a fixed duration, and competitors are not asleep - customers can spend everything on them. Accordingly, it is critically important that each employee knows his area of responsibility and incidents are resolved by the “pipeline”.

Let's look at each stage separately, define problem points and consider ways to optimize them quickly.

All the following tips, hints and recommendations - this is not a recipe for "beautiful life", but specific things that you will manage to implement in the next 3-4 months left before Black Friday.

Detect the accident

In the most unsuccessful scenario, the client informs you about the failures. That is, the problem is so serious that he spent his time reporting . In this case, only a very loyal customer will write or call, and a simple user will leave, with a shrug.

In addition, often the client does not have direct access to the IT department. Therefore, he either writes to info@business.ru, or calls the girls from the call center. When the information crawls to IT, it will take a lot of time.

Suppose we have a lot of loyal customers, and each of them considers it his duty to write to the TP about problems. While the incident is classified as massive, while escalating and deciding, hours will pass. In this case, single treatment may be lost, and info@business.ru mail is sometimes not raked for weeks.

Therefore, it will be very useful to start self-tracking of key business parameters. At a minimum - the number of users on the site, the number of purchases made and their ratio. These data will allow you to quickly respond if something went wrong, and significantly reduce the time to identify (and solve) a specific problem in the operation of the site.

No users? We must see where they could go. There are users on the site, but no sales? This is a signal about the problem, and rather late. Detect that something has happened somewherewill help automated scripting testing. Usually, autotests are driven by builds or releases, but they are just great for monitoring. With their help, you can see the breakdown or slowing down of some important business process through the eyes of the user.

Of course, if you don’t have a scenario test, for the few months left before Black Friday, you won’t cover the entire production with tests. Yes, and they can give a serious burden. But with tests of a dozen basic processes, it is quite possible to have time.

It is also very useful to track the average response time of servers. If it grows, you can expect sales problems. Such data should be automatically tracked by the monitoring system.

As you can see, on proper monitoring you can reduce the time it takes to detect a problem with hours and days.up to a few minutes, and sometimes see the existence of a problem before it gets to its full height.

Incident Response Time

We did a great job and, thanks to monitoring, we instantly found a failure. Now you need to start an incident, assign a priority, route and assign the person responsible for further processing.

Two things are important here:

Get notified of the problem as soon as possible;
Be prepared to promptly process the notification.

Many IT professionals are not accustomed to respond quickly to emails, even if they have a client on their smartphone. So important notifications should not be sent by email.

Use SMS for alerts on accidents. Even better, implement a caller bot for the most critical cases. I personally have not seen practical implementations of such bots, but if resources allow, why not? In a pinch, use WhatsApp / Viber / Jabber. Alas, the Telegram on the territory of the Russian Federation for many understandable reasons cannot be a reliable channel for emergency notification.

Automatic escalation of the incident can also be useful in the absence of confirmation. That is, the monitoring will notify the next one in the queue if the main recipient of the notification does not respond. This system will back you up if when something (or someone) goes wrong.

Now let's talk about how to ensure prompt response to failure messages. First, someone must be prepared to be responsible for handling alerts. Alerts to the whole team are useful, but only for keeping people up to date.

Collective responsibility is unreliable when speed is required.

If at the time of the shares do not set on duty on a clear schedule, you may be faced with the fact that during force majeure someone will sleep and someone will not have access from home. Someone will be on the road at all. And in fact, there is no one to tackle the problem in the next hour. Of course, you can put round the clock operational duty, but there is a nuance here. You will not force good specialists to work constantly in shifts, which means that when they are needed, you will still have to look for them and wake them up. And those who still work in shifts, tightly fall out of the general context of the life of the team. This has the most fatal effect on their effectiveness in planned tasks.

It will save us that in most projects we must promptly respond to messages, understand what has happened, and urgently need to be repaired about 18 hours.per day. Usually, the period from 6–8 am to 1–2 am of the next night accounts for up to 90% of traffic and sales.

To avoid overlaps, it is sufficient to shift the work schedule for attendants to formats such as:

6: 00-15: 00 and 17: 00-02: 00 - watch "from home";
15: 00-17: 00 - cover those who are in the office;
02: 00-06: 00 - little traffic. However, do not assign a very hard sleeping responsible.

Do not forget the weekend. This issue can be resolved in the same way.

If your daily activity of users is distributed differently, select a similar schedule, in which the site will not be left unattended during prime time.

To be on duty is to be responsible for handling monitoring events, calls from previous lines (customer support) and monitoring the system as a whole. But while everything is quiet, the duty officer is engaged in his main work.

Be sure to start duty a few days before the load. First, it will once again make sure that everyone has all the access. Secondly, a change in the working mode is stressful, many will need to “tune in”. And it would be better if the period of habituation does not coincide with the main heat.

Great, alerts come, and it’s exactly the people who need to respond to them. But the reaction time on duty is greatly affected by the presence of unnecessary and unprocessed alerts, as well as notifications, which in principle do not imply any action. It is very important not to leave unprocessed alerts. If many similar events occur on a regular basis, investigate the cause and fix it. In the monitoring system should not remain active alarms.

According to experience, if something cannot be fixed quickly or it does not require repair, but it still blinks, it is better to suppress the alert and create a task for study. Constantly blinking alarm will sooner or later become familiar and stop attracting attention. The trouble is that in the event of a real problem, people can confuse a light bulb and ignore a really important event.

Another extremely important is the competent setting and prioritization of events in the monitoring system. The system should notify you exactly what needs to be fixed. About specific failures or risk of their occurrence. You will not repair 100% CPU Usage? You will eliminate high delays on the WEB-server, because CPU Usage is information for debag, not a problem. If on Black Friday the processor is 100% loaded with a target load, response speed and taking into account stocks, this means that you have correctly calculated everything.

The utilization of system resources must be controlled, but this is a slightly different task, which is important for planning resources and identifying areas of accident impact.

We have set up the events; now it is important to correctly prioritize what we will fix first. To do this, let's look at the differences between the Critical and Warning alert levels. I will give a little exaggerated, but understandable examples.

Critical - this is when you go to the grandmother on the subway, get an alert and go to the nearest station. You take out the laptop, sit on the bench and start working - there was a stop in sales or there were heavy losses. That is, Critical is something that has a direct, moreover significant influence on users.

Warning- This is when you do not leave work until you fix it. Throw everything and run to the rescue for the sake of Warning is not necessary. You can smoke / finish and make a decision. For example, there was a clear risk of critical problems like a fallen server from an HA pair, errors and the like spilled into the logs. If you do not clog and conscientiously repair such events, (and also get to the bottom of the causes and work to prevent them) there will be very few of them.

Another thing that is often forgotten. Do not throw on duty alone admins. Be sure to attract developers by forming work pairs for each shift. This will be useful to us in the following stages.

If the project is functionally complex, it makes sense to send on duty consultants, analysts, testers and all others who may be useful. Make them available at least by call. The specialist will have to confirm the problem (or vice versa) and help with functional localization - when you have to pick up a person for repair, it will save your time. I will discuss this issue in more detail in the next section.

And the last important point. Each employee on duty must thoroughly know the contacts and areas of responsibility of all his colleagues in the state of emergency. If he cannot solve the problem on his own and starts searching for available rescuers in a panic, chaos will occur, due to which you will lose a lot of time.

Compliance with these simple rules will help to avoid problems due to missed alerts and ensures that when an emergency comes (read both as “Black Friday” and “emergency”), people will be able to solve problems quickly.

We confirm the presence of the incident

The next step after receiving the notification is to understand what exactly went wrong and whether there is a problem in principle: it is not always easy to determine who is right, the user or the system. The fact is that the same alert can be interpreted differently depending on the angle of view.

For example, a typical admin who received information about bugs in a search engine (products disappeared) will go to check the search server and read the logs. He will spend a lot of time and make sure that the search is working. Then he will climb even deeper in trying to understand what is broken. As a result, it turns out that the “missing” products were deliberately hidden and there was no problem, just the user was not in the know.

Or the admin will fall into a stupor, and then close the ticket for the lack of crime. Well, what, other products are great looking! But in fact, someone accidentally deleted goods from the landing page from the database, and the entire advertising campaign turned into a “demotivational” one.

In the first case, the admin spent time localizing a non-existent problem due to incomplete information. In the second, the “angle of view” is “to blame”. The admin will look for a technical problem, while the analyst will quickly detect the logical and restore the goods.

The solution here is only one thing - if you receive an automatic notification, you should clearly know what it means and how to check it. It is desirable in the form of written instructions. If we are talking about messages from users, first of all they should be dealt with not so much a technical, as a functional specialist with a technical background. It is he who will take on yet another annoying problem - the confused messages that are perfectly familiar to you à la “everything is hanging on me”, “your website is not working” and “I click, but it doesn’t want”.

Before understanding further, it is necessary to understand what exactly happened at the person, and to be convinced that the problem is "real". To do this, in technical support, where the user reports a problem, must be polite and experienced specialists. Their task is to extract as much information as possible and understand that,According to the visitor , this is not the case. Based on this information, you can determine: this is a technical problem with the site, or, let's say , the interface was not intuitive enough.

Localize failure

Great, we got the alert. Make sure that there is a problem. Next you need to understand its technical essence and outline its zone of influence. We have to see what exactly is not working, why and how to fix it. At this stage, our main enemy is still the same as before: the lack of information.

It helps to fill it with good monitoring and logging. First, the key parameters of the system, which we talked about in the first paragraph - sales, visitors, page generation speed, technical errors in server responses, should be displayed in the form of graphs on a large screen (the more, the better) in the service room support

All important data is alwaysMust be in front of your support. During a state of emergency or any other action, this will allow them to quickly respond to changes in indicators and prevent a problem.

To localize the failing component, you will need a site map with data on the interaction of components and their relationships. To quickly detect problem points, you need to track the data for each interaction flow in dynamics.

For example, an application accesses a database. This means that for each database server both from the server and from the client side, we should see the following:

Number of requests per second;
Number of replies;
Response time;
The volume of transmitted responses;
Technical errors of this interaction (authorization, connections, etc.).

After the problem component is localized, you can go to the logs and see what is wrong with it, poor thing. Great to speed up the process will help centralized log collector. For example, on ELK .

Also, as I wrote in the last article , significant time savings are achieved due to the convenience of searching through cluster logs and the ability to track request processing throughout the chain.

We eliminate the failure

At this stage, we are finally repairing what has broken down, and understanding how to speed up this process.

Obviously, our best assistant is an instruction for troubleshooting. Unfortunately, we will have it only if we have already encountered this situation earlier. Well, and did not forget to write down the working decision. If there is no instruction, you will have to go through trial and error.

When you have to repair something new, you need to weigh the safety of work and the need for early intervention. Checking the correction in the test environment, on the one hand, reduces the risks, and on the other hand, delays the solution of the problem.

I try to be guided by the following rule: if I’m quite sure that it willn’t get any worse, or I can’t reproduce the problem in a test environment, I can try to fix it right away. But such a method is justified only if 3 factors coincide at once:

Everything lies;
The drug will not affect the valuable data;
There are backups.

In other cases, it is worthwhile to reproduce the problem in the test and double-check everything before transferring to the production. Avoiding iterations on re-correction will help high-quality work in the previous stages (awareness of the problem and its localization). As a rule, repairing it from the first time does not work if we repair something that is not broken, or something is not taken into account.

And here we again come to the aid load testing. We emulate the work of the product and begin to specifically break it. This is needed to understand how it works, what kind of problems affect it. In addition, this is a great way to learn how to repair an application, and at the same time write repair instructions.

After that, it will be possible to conduct tactical exercises to localize and eliminate problems in the testing area. For example, when one of the leading specialists breaks something slyly, maybe not even in one place, and sends someone to sort it out and fix it on their own. For a while. Very useful practice. He teaches him to work in a stressful situation, and he learns the system, and hones his skills, and a new sea is born.

In conclusion of our small methodical educational program, I want to draw your attention to the importance of current instructions, formal schedules and other paperwork unloved by many. Yes, it eats up the lion's share of time and energy. But the time spent will return to you a hundredfold, when the thunder clap, and you will “fix it all” without unnecessary nerves.

Operation is an SLA. And SLA is about keeping timings as a whole, and separately, at each stage. To control the implementation of SLA and these very timings, you need to know the time limits for each stage. Otherwise, until you go beyond the framework, you will not understand that you are already late somewhere. And without fixing the algorithms of work and specific actions at each stage, one can neither estimate nor guarantee the duration of these stages.

Creativity is very interesting, but completely unpredictable. Engage them for the soul, and test and implement the most successful solutions, but not during the preparation for Black Friday or another promotion. Business will thank you for it.

So far this is all that I would like to tell about this topic. I would be glad if my advice, being transferred to the realities of your business, will allow us to survive a high load calmly and comfortably.

If you want advice on how to act exactly in your situation, I invite you to my seminar “Black Friday. Secrets of survival. In the question-answer format we will talk about preparing the site for traffic growth and discuss both technical and organizational subtleties of this process.

The seminar will be held on August 16 in Moscow. Since the event will be quite intimate (maximum 25 people), pre-registration is required. And all the others I am waiting for discussion in the comments. :)

Tags: