Highly loaded project support
Evgeny Potapov ( eapotapov )
The report will be about what to do with the project after we launched it. You planned the architecture of the project, you thought out how the infrastructure will work for it, thought out how you will balance the load, and finally, it was launched. What to do next? How to support, how to make the project continue to work, and how to make sure that in the end nothing falls?
First, a little about yourself. I am from ITSumma, our company is engaged in round-the-clock support and administration of websites. Now we have 50 employees, we were founded in 2008, we support more than 1000 servers, which are visited by more than 100 million people every day.
There are large support projects, there are small, mainly Russian projects, and based on the experience of supporting these projects, I would like to talk about what is worth doing.
If we talk about support, what does it consist of? In my opinion, it consists of three components:
- monitoring the project and responding to alerts that come,
- organization of backup and backup (we share these things),
- organization of service and support, in fact, response to what is happening with the project.
If we talk about monitoring, we all understand that monitoring is necessary. You need to understand how and why to use it. We can divide the use of monitoring into three stages, how they come to mind, how people come to them, companies come.
At the first stage, monitoring is considered simply as a warning system about the state of affairs with the project. Alerting about problems on the server is like the easiest step; alerts about problems associated with the application logic are like a more complicated thing when we observe that something has become wrong with the site, with the project and we understand this already from some indicators of what is happening on the site; and reporting business performance issues.
If we talk about the requirements for the warning system itself on any highly loaded project, then we must understand that, first of all, the monitoring system should be independent of the project itself. It should not be located where the project is located, it should not be associated with it. If the project falls, the monitoring system should not fall.
The monitoring system should notify you as quickly as possible of what is happening on the project. If you have any kind of accident, or if it seems to you that an accident will happen soon, notifying you that something is happening should not come after a long time, you should find out about it as soon as possible. And you must be sure that you reliably receive information about what is happening with the project. Those. You must understand clearly that when the critical indicators that you set for the project are achieved, or when some kind of accident occurs on the project, you will receive a notification.
If we talk about alert levels. The first level is problems on the server, when we simply, in the most basic way, look at some physical indicators, indicators related to software on the operating system, and basic software. The second one is problems at the application level, when we look at how the application subsystem interacts with other subsystems, how they interact with external services, and what happens there. And the third level of notification is about problems at the level of business logic.
Problems on the server. The simplest things you can monitor there are:
- statistics on the load on the central processor, where we look at the load on the processor itself and the indirect indicators for this load - load average - moving average;
- statistics on the load on the disk subsystem;
- statistics on the use of RAM and swap.
If we talk about all the indicators on the server that we can monitor, these three things in the first place can tell us that something is wrong with the server. CPU grows with traffic growth or suboptimal code. The load on the disk subsystem most often occurs with suboptimal interaction with the database. The use of RAM increases dramatically, and the memory goes into swap if you have increased traffic, if you again have some kind of suboptimal code.
Statistics on server software. We set alerts for drops and surges in the number of requests to the server. We look externally at the site accessibility checks, and if we install external accessibility checks, we should check not only the response time of pages and the response code, but often errors occur on the development side when, instead of http 502, the server gives http 200, and the page , it seems, it gives an error, but for the monitoring system, for search engines, it continues to be considered a valid, normal page in which the error text is simply written. Therefore, we recommend:
- in addition to checking the time and response code, look at the size of the page’s response, when it usually doesn’t change with you in time, and if something happens related to a big-big accident, then the size most often drops sharply down;
- using already appeared tools like CasperJS, look at the time of directly rendering the page. Very often we see how on some external services (in the last year, the Cackle comment service has sinned this several times), the rendering slowed down drastically, i.e. pages are generated that you check on the server, but people only see a piece of some pictures in the browser, they don’t sink to the end.
Application level alerts. First of all, you should monitor the number of application-level errors. Yes, there are logs that you can look into, but most often on any projects, sooner or later, in an effort to do some new features, there is not enough time to figure out the number of notification messages about which information comes from the application. Instead of dealing with this thread, one of the simplest methods is to monitor the number of such messages per minute. Those. we see that we have about 100-200 such messages per minute, then as a result of some regular calculation, this number jumps sharply to 2000-3000, i.e. we are not interested in a separate line, we are interested in the total large number of these alerts.
The number of calls to subsystems / "nodes". Suppose you have a project that makes good use of the caching system and rarely accesses the database. You can and should monitor the number of database calls, selects, inserts, updates, shares and continue to look at the changes. In the process, if you do not have sharp jumps in traffic, this amount will be kept approximately equal. If something happens to you with the caching system, you will see a surge in such requests and you will understand that something is wrong, and you should investigate. This should be done to any subsystems, and already further, based on the information that you collect, put some additional alerts.
If you have any external services, I understand them as databases, services located on other servers on your site, external APIs, then you should definitely monitor the interaction with these services. Especially with external APIs. We all think that we have problems, and in large companies everything is cool and good, but our experience of interacting with Facebook API clients shows that bugs are very common there, the response time increases sharply, and you can assume that the site starts to slow down for you, but in fact, it drastically worsened the interaction with some external API, which we did not even suspect that it could get worse. Accordingly, we monitor the interaction time of those important requests that we need, and when the response time jumps in these requests, we begin to investigate
There is such a cool thing, which for some reason is very rarely done (which is surprising) - it is monitoring business logic.
When we monitor the server, we have a million indicators that we can monitor, and we can think about some of the indicators that it is not important now: “I will not close it with an alert, I can’t understand what it can to be critical, I can’t understand what to react to. ” And sooner or later, when this indicator pops up, it may happen that you do not have an alert for such an accident, and the accident happened. All these one million indicators, in the end, turn into problems with what this application exists for. The application exists, relatively speaking, starting from the amount of money earned per day, the number of users who took some action, to simple pieces - the number of purchases on the site, the number of orders placed on the site, the number of posts on the site. All of these things are pretty easy to monitor, especially if you are a developer, and put on alerts already from this point of view. Then, even if we missed the monitoring and notifications from the server software, let the hardware begin to slow down, and because of this, users began to fall off on some of our actions, we will understand that this traffic or action drop occurred due to for this reason, and we can already investigate the graphs ourselves, investigate the causes of what is happening and figure it out.
The second thing that you should often do is emulate the user logic of the application. We often monitor server stuff, we monitor hotel scripts to ensure that they respond. But, for example, it’s very cool to monitor if we have a login to the site, if the user comes and signs up, then fills out the form, receives a letter with a registration link, clicks on this link and goes to the site. We have at least five places where something can break. If we monitor each of these five places separately, we are not sure that this entire sequence works as a whole. Create a script that will go to the site, fill out a form, submit the data for the post, then create a script that will check the mail and click on the link that was there, just check that the cookies have recovered - this work technically for two days, and control, which will result from this significantly exceeds the cost of creating this script. Therefore, I propose that this thing is mandatory for most critical functions on the site.
Another thing that we do not quite see correctly in monitoring is when alerts are set to critical indicators already close to a terrible, terrible disaster. People put alerts on CPU utilization at 99% for 5-10 minutes. The moment when such data comes to you, you technically can hardly do anything, i.e. you already have a loaded server in yourselves, you already have a crashed application and you need to decide what to do about it in a fire, in a fire. At the start of the project, after some time of launch, and after you can analyze the nature of your traffic, you can understand the nature of the load. For example, if your processor is loaded at 30%, you do not need to set an alert at 90%, you do not need to set an alert at 99%, you need to choose some key points for yourself, on which you need to make new decisions. Those. the alert is 50%, and you understand that you usually had a load of 30%, which you are used to, and now 50%. This is not scary from the point of view of the server, but you should think about what the further growth will be, and when you reach 70%, etc. In such cases, you will get a margin of time when you can decide what to do, can you still live with it, or is it time to think how to change the architecture, maybe buy some new hardware, maybe you need to do something with the code, if recently something happened, because of which he began to respond longer, etc.
Monitoring as an analysis system. If we said before that monitoring is a warning system, then choosing a monitoring system, we must understand that this is not only a warning system, but also an analysis system that:
- a) must store as much data as possible;
- b) should provide an opportunity to quickly select these data for the desired period;
- c) be able to quickly display them in the form we need.
Among other things, this gizmo should be able to compare the same type of data on different servers, if we say that we have a multi-platform, multi-server system. Because, if we have five servers of the same type, and we look at each in monitoring separately, we may not notice that one server began to deviate from the others by 30%, but if we group statistics on all five servers together, then we will see it.
The monitoring system should have a comparison of current data with historical data, because if we have a slight decline or slight growth over time, just looking with an eye on the chart, you can not pay attention to this. If we choose two lines - one current, and the second, for example, as it was a month ago, then we will see the real difference, and we can already understand what is happening.
Quick retrieval of a large range of historical data. It is understood that we are actually recording a huge amount of data, and monitoring itself is a highly loaded application. If monitoring stores this data, it should be able to upload this data.
In a report at RootConf, my colleague and I examined examples of the monitoring systems that are currently available, and if we talk about what can be used, then more or less ready these are the classic Zabbix and Graphite. But what’s interesting, what I'm talking about the third part with a quick selection of a large range of historical data - here, if you look at our customers, somewhere around 70% sits on Zabbix and about 20% sits on Graphite, and any the node with Graphite is almost 100% crammed on the processor. Those. the system is good for recording data, it is good for displaying it, but regularly there are problems with the fact that monitoring simply cannot draw what they want from it. Those. the system must do it quickly, but you cannot get it fast.
An additional thing, if you choose a monitoring system, would be very useful - this is a complex aggregation of metrics. Most often, some of the metrics on the project may be relatively noisy, this noise needs to be minimized somehow, a moving average should be obtained, a percentile should be looked at for requests, in how much time 99% of requests are completed, group requests at the minimum over time and etc.
In addition, monitoring should be imagined as a system for making decisions.
Those. we have monitoring as a warning system, we have monitoring as a system to analyze why we had a problem, and to understand how to avoid this problem, and most importantly, we get it, we must use the monitoring system as a thing that we will allow:
- a) understand how the accident happened, and what to do so that it does not happen again;
- b) look at how the system will develop in the future;
- c) look at how to do so in order to avoid mistakes that have already been.
If we talk about affordable solutions, of the really good ones that work on a small project and do not cost too much in terms of implementation, I believe that you need to use SaaS services. Three services are listed above, they do not have to be used, they are almost all the same. For a small number of servers, they are inexpensive. And most importantly, they will not spend your time on their integration.
Sophisticated data collection. I would now pay attention to Graphite, but you will have to suffer a lot, deciding how to store data there. If you have a very complex system, if you have very complex data, you can think about making your own system or strongly customizing open source, but you need to understand that this is a very, very big contribution to investing in the development of this system.
We use our own system because we have historically developed that the infrastructure is very heterogeneous, we have high requirements for how it should be supported, and how problems within the company should be scaled. We have been developing this contraption since 2008. We have two developers who are constantly engaged in this, and we have a very, very many wishes from administrators on how to change this thing further. The only reason we use this thing, not Graphite or Zabbix, is because we understand what we can get from the monitoring system. We understand how to finalize it, and we can do it or we can’t do it. But besides everything else, you need to understand that you get used to the monitoring system, and the moment you choose it, the longer you use the same system,
We pass to the second part. Talk about backup and backup.
First of all, I would like to talk about backup. In fact, this thing is very often among our customers and among those whom we see, it seems a simple thing, which in itself works, which is not even worth thinking about. However, when making a backup, you need to understand that:
- Backup creates a load on the server.
- Regular backups can play a big role for your project. For example, you constantly have a huge number of users perform some kind of action, let it be a service where people process photos for money. If your backups are done once a day, the data is lost, and you recovered from 20 hours ago, you will still get a huge number of problems related to the fact that users will say: “OK, where is my data for the last 20 hours? ".
- You need to understand that the recovery time from the backup also plays a role. Those. You can make a dump, which is done quickly, but in some cases this dump can take a very long time.
- It’s a difficult procedure in itself.
- Because of this, backups are best created from backup machines without creating a load on production machines.
- You need to understand what needs to be reserved. The classic thing is often a backup. “But we have a lot of static that users generate, we’d better generate it once a day.” But what is the point of backing up the database every few hours, if when you recover from it, you will not have statics, and users will be just as unhappy as before? Those. they will leave, relatively speaking. You have the second VKontaktik service, you dropped the base, your server crashed, and you lifted it from the backup with the base, you have links to photos, but there are no photos ...
- The classic thing. A backup without a regular recovery procedure is not a backup. Very often we see when people organized a backup, they download it, and they believe that this is enough. In four out of ten cases from the backup, which is, it is impossible to recover without checks. A regular recovery procedure should be created in the organization, a regular plan of exercises, according to which you start to recover from a backup on a given date, check how much you can do, how relevant your data will be, and how much the site will be operational after restoring from a backup .
We use the slave server only as a backup in case of a failure of the master server. There is a classic story of the fall of Sberbank a few years ago when they used the slave server as a backup and did nothing else with it. Those. there were no backups, there was only a slave, but you need to understand that slave accepts all requests, and if a person came to the master server and said: “Delete everything from the master server”, slave will receive exactly the same command.
On some bases, I can’t say for Postgres, I saw such things in MySQL, it is possible to make a slave with a delayed update. This is a small hack that sometimes helps. Those. you do a delayed update of the slave for about 30 minutes, and you already have time when you realize that you accidentally made a drop on the master base, you can abruptly interrupt the replica and switch to slave instead of doing a long recovery procedure from base.
Hot-backup services for regular external backups. Those. we have a slave for protection against accidents, and a hot-backup for protection against human factors.
We keep at least one copy within the same site where we are in order to quickly download it, and we store at least one copy on the external site for recovery in a global accident.
You cannot store a copy in the same place where you expect an accident to happen, because when the data center in Hetzner falls completely, you cannot recover if it falls for several days. All these few days, you, even having the right backup, even doing the right procedures, will just wait until the hosting comes up. At the same time, you need to understand that hosting crashes are not always a matter of 10, 15, 20 minutes. Amazon has had four accidents over the past five years, the longest of which lasted 48 hours, the next one lasted 24 hours, the next one lasted 16 hours, the next one lasted about 10 hours. Each time they said that “we terribly apologize, we will return the money that the hosting cost at that time”, but it is clear that this is not comparable with the losses that were actually. Therefore, it’s ideal to store backups in another place,
Content A complicated, complex topic, especially if you have a lot of it. If you don’t have much of it, do regular Rsync, do snapshots. If you have a lot of it, then it is best to implement this thing right away at the application level. Let your photo be uploaded, some kind of file that is uploaded or processed by the user. After issuing the content data to the user, make this file duplicated for backup at the moment. You will not need to then regularly copy the entire packet of changes once a day, once per hour, etc. One action, but simple, it is done imperceptibly for the user, because the content is already given to him. Just in the background, the file is poured into another place, to another site, which is independent of the main one.
And the very common thing that we see is that the database is backed up, content is backed up, and configs do not back up. As a result, the data center crashes, the server crashes, we need to recover, we are recovering from backups, we have backup codes, static backups, we have everything cool, but we have a very, very complicated web server configuration, and the next few hours this command The project is engaged in recovering the config from memory or trying to restore the config two years ago.
There is a backup, there is a backup, these are two different things.
Site selection for backup.
The backup site should not be connected to the current data center. We very often see when a backup server is taken to the main server in the same place, i.e., relatively speaking, the same Hetzner, where you have one machine in the rack, and the second machine in the rack acts as a reserve. Yes, it will protect against crashes of the server itself, but it does not protect against crashes in the entire data center as a whole. If we choose a backup site to the main, most often people want to make it cheaper because it does not receive traffic. You need to understand that after switching to the backup, you may find yourself in a situation where you stay with her for a long time. And if you think that you can live on it for an hour, but you can’t live on it for a day or more, it’s better to think about whether to do it.
The issue of capacity for redundancy. At the start of the project, I really do not want to raise the same architecture that is on production. Those. Suppose you have a dozen servers on production, in order to understand the reserve, you need to raise ten of these servers in another place, and actually do nothing with them. You can try to raise not ten servers and ten servers, but five servers and five servers, and distribute traffic to them. But here we meet two questions:
- the first is synchronization between different data centers and the problem of the connectivity between them.
- the second - when an accident occurs in one of them, it turns out that five servers receive traffic that ten should have taken before, and load problems still occur. If your site has died for a long time, you have been in these five machines in overloaded mode for quite some time.
The thing that people often use is the so-called hybrid cloud. You have a site that is under production, and you have a minimal configuration in the clouds, capable of only accepting replication from the production site. If some kind of accident happens on the production site, the clouds will scale up to become the same as the site was, and traffic will switch to them. In the course of a long time, it will be somewhat more expensive than iron production, but in the process of using it regularly, when you have no accidents, you save time and money on this reserve, and at the same time you do not risk getting into some kind of trouble switching.
Important things when making a reservation. First of all, you need to understand how long this switch will take you. You need to understand how integral data you have synchronized to the reserve. And you need to understand the risk of downtime when switching.
It’s a very, very complicated procedure that people don’t like ... Checking recovery from backups is, in principle, simple, but you also need to regularly check the ability to switch to the backup site. You cannot make a backup site and wait only for the moment when you can switch to it. There are a million options, which may cause the reserve to fail. The simplest thing is that you simply did not plan something when designing a backup architecture. Yes, if we switch without an accident, and something goes wrong there, the project falls, it will be at least a conscious risk. At night or at some quiet hour when this can be done. This is better than if you, switching at the time of the accident, find out that you are missing something, some files are not synchronized, some database is behind, etc.
- Slave is not a backup.
- A backup in the same data center is not a backup.
- A reserve in the same data center is not a reserve.
- A backup that was not restored is not a backup.
- A backup without statics is not a backup, because people simply will not know what they have with the content.
- If there is not enough space - this is not an excuse. Very often we hear: “Guys, we have a bad backup, we don’t have enough space now, we will solve it in a week”. Just the other day, clients who were going to solve the problem in a week lost two servers. There was an extra reservation that we made that helped. But it will be very disappointing when you realize that you were thinking for a week, when will you have time to backup and take a server in Hetzner for 60 euros from 2 TB disks, instead you did nothing and lost data that cost much more.
- “I’ll decide about backup the other day” is a phrase that leads to lost data. One step to lost data.
And it is important to understand that all these things do not work if there is no team that will respond to these pieces and do these pieces.
What this team should have. The team should be inside you. It may be a little outside, but still there must be some people who inside understand how it all works, and what it should be. They must understand the inside of the product and should be able to localize the problem. They should be involved in your company’s team, simply because most often, according to the law of meanness, accidents occur at night, and no matter what time zone you are in, when a person wakes up from a call from a manager or partner, or from SMS about an accident, he should not score, just hang up, but understand that trouble is happening, and his help is needed.
They must have tremendous stress resistance, because sometimes, especially at the start of the project, such SMS messages come every day for the first six months. Most often, at the start, these are developers with the makings of an administrator who know how the code works, and can fix it, and know how to configure the server to make it work. And there should be extraverts, because if you work with them or you need to interact with them, it is very unpleasant to wait for 30 minutes or an hour until a person understands, but he does not speak about it. Those. you say: "Vasya, how are you?" And Vasya is not responding. It’s better if at least someone in this team has an extrovert who says: “Guys, there it’s this way and that, we do it all, and now we’ll fix it.”
All this does not work without organization. Yes, there is this team, but you need to understand that the whole team in the support process very often works in chaotic conditions. An accident came, which had never happened before, this accident was fixed. It is necessary to understand how this accident happened, how it was fixed, and how to make it so that it no longer exists. If we corrected the accident and did not understand how to fix it, we did not actually fix it and we get another one. If we had an incident that was not monitored, we should fundamentally understand that an accident occurred, it was not monitored, so the first steps that must be taken are server monitoring, application monitoring, so that there is no such accident anymore.
Such teams really do not like bureaucracy at the time of struggle with the problem, at the time of solving problems, but then we need to seek to formalize what has been carried out, formalize all of these procedures.
How to work on supporting such a thing at the start of your project. At the start, it will be just SMS messages to key people and phone calls to people who do not sleep and watch constantly how it works. Then, this may be the duty of people within the team, additional people. Two start in two in 12 hours each. But people will not last long, simply because those who work on the night shift, they quite possibly have wives, children, families. And when they work from 8 in the evening until 8 in the morning (I say this from life experience), at 10-11 in the morning their very same wives, children, families, friends, will start calling and saying: “Vasya, let's go further ". And it’s very difficult at this moment to sleep and understand that you simply have no life left.
From experience: each of the people who receive alerts should have at least three phones of critical people to notify whom he should call and report that there is a problem. One phone may not be picked up, two phones may not be picked up, but I don’t remember cases when they didn’t pick up the phone with three phone calls.
Additional metrics that can be viewed on support after a long time. This is Alerts per hour, because you need to understand that the situation when you do not have an accident on a live project is a rather rare situation, something always happens wrong. But the number of incidents that occur in a certain period - per hour, per week, per month, depending on the scale - is also a metric that can be monitored. If you have had five incidents per week during the last six months, and you have five incidents per day during the last month, this means that something is going wrong, you need to investigate it anyway.
Time spent on servers. Perhaps you have 20 servers and spend much more time on one of them than on the others. Perhaps the team itself does not see this, perhaps you do not see it, but there is something on these servers that needs to be dealt with.
The thing that we recently made for ourselves is that we began to record SSH sessions. This is not for control, but in order to understand what is the reason as a result of accidents. Do not remember how I did in a panic for the last half hour, trying to quickly perform a surgical operation, but just look again, find what was done, write down, formulate and understand.
More about work. This is more of a joke category. If several key people fly on the same flight on an airplane, the servers will certainly fall. It is necessary to ensure that of the key people, someone always remains available for servers.
People who get SMS. Ideally, they should have at least two ways of communication. Those. receive phone calls, be with a laptop, etc. - At least two ways to communicate with you. Because cellular communication does not work everywhere, phones are lost, the phone can fall, two SIM cards are already more reliable protection ... It is ideal if a redirect is set up on these SIM cards to be inaccessible to another person who will respond to this. And the last - the life of such people is hard, love them and appreciate them.
Instead of conclusions.
- Просто сделать проект и запустить — не работает. Т.е. если вы верите в то, что вы спланируете архитектуру, подготовите сам проект, подготовите код, запуститесь, и все будет работать — нет. Проект живой, он постоянно меняется. Человек, и тот, не бывает здоров всю жизнь, у него появляются болезни, у него появляются какие-то болячки.
- Замониторить и больше никогда не упадет — тоже не работает. Мониториг — это штука, которая будет вас оповещать о том, что происходит.
- Зарезервировать и не проверить — не работает.
- Зарезервировать, замониторить, но не организовать службу поддержки — не работает. Пусть у вас прилетает очень много алертов, но если никто не знает, что с ними делать, будет не понятно, что делать дальше.
» Itsumma 's blog
Этот доклад — расшифровка одного из лучших выступлений на обучающей конференции разработчиков высоконагруженных систем HighLoad++ Junior.
Также некоторые из этих материалов используются нами в обучающем онлайн-курсе по разработке высоконагруженных систем HighLoad.Guide — это цепочка специально подобранных писем, статей, материалов, видео. Уже сейчас в нашем учебнике более 30 уникальных материалов. Подключайтесь!
Ну и главная новость — мы начали подготовку весеннего фестиваля "Российские интернет-технологии", в который входит восемь конференций, включая HighLoad++ Junior. Мы, конечно, жадные коммерсы, но сейчас продаём билеты по себестоимости — можно успеть до повышения цен