ITSumma July 19, 2019 at 10:12

Failover: perfectionism ruins us and ... laziness

In the summer, both purchasing activity and the intensity of changes in the infrastructure of web projects traditionally decline, Captain Evidence tells us. Just because even IT people happen to go on vacation. And CTO too. It’s all the harder for those who remain at the post, but not about that now: perhaps this is why summer is the best time to rush through the existing reservation scheme and make a plan for its improvement. And in this you will benefit from the experience of Yegor Andreev from AdminDivision , which he spoke about at the Uptime day conference .

During the construction of reserve sites, during reservation there are several traps that you can fall into. And to fall into them is absolutely impossible. And ruining us in all this, as well as in many other things, perfectionism and ... laziness. We are trying to do everything-everything-everything is perfect, but you don’t have to do it perfectly! It is only necessary to do certain things, but to do them correctly, to bring them to the end, so that they work normally.

Failover is not some kind of “fun to have” fun thing; it is a thing that should do exactly one thing - to reduce downtime so that the service, company, loses less money. And in all reservation methods, I suggest thinking in the following context: where is the money?

First trap: when we build large reliable systems and do backups, we reduce the number of accidents. This is a terrible fallacy. When we do backups, we most likely increase the number of accidents. And if we do everything right, then together we will reduce downtime. There will be more accidents, but they will occur at a lower cost. After all, what is redundancy? Is a complication of the system. Any complication is bad: we get more cogs, more gears, in a word, more elements - and, therefore, a higher chance of breakdown. And they really break. And they will break more often. A simple example: let's say we have a website with PHP, MySQL. And he urgently needs to be reserved.

Shtosh (c) We take the second site, we build an identical system ... The complexity becomes twice as big - we have two entities. And also we roll certain logic of data transfer from one platform to another from above - that is, data replication, copying statics and so on. So, the logic of replication is usually very complex, and therefore, the total complexity of the system may not be 2, but 3, 5, 10 times more.

Second trap: when we build truly large complex systems, we fantasize what we want to get as a result. Voila: we want to get a super-reliable system that works without any downtime at all, switches in half a second (or better in general instantly), and begin to turn dreams into reality. But there is also a nuance: the shorter the desired switching time, the more complex the system logic turns out. The more difficult we have to do this logic, the more often the system will break. And you can get into a very unpleasant situation: we are doing our best to reduce the downtime, but in fact we complicate things, and when something goes wrong, the downtime will be longer. Here you often catch yourself thinking: here ... it would be better if they had not been reserved. It would be better if it worked alone with an understandable downtime.

How to deal with this? We must stop lying to ourselves, stop flattering ourselves that we are going to build a spaceship here, but to adequately understand how much the project can lie down. And for this maximum time, we will choose with which, in fact, methods we will increase the reliability of our system.

It's time for "stories from w" ... from life, of course.

Example number one

Imagine the site card of the pipe rolling plant No. 1 of the city N. It is written in huge letters on it - PIPELINE PLANT No. 1. A little lower - the slogan: "Our pipes are the most round pipes in N". And below the phone number of the CEO and his name. We understand that you need to reserve - this is a very important thing! We begin to understand what it consists of. Html-statics - that is, a couple of pictures where the general, in fact, at the table in the bath with his partner is discussing some next deal. We start thinking about downtime. It comes to mind: you need to lie there for five minutes, no more. And then the question is: how much sales from this site were in general? How much how much? What does zero mean? And that means: because the general made all four transactions over the past year at the same table, with the same people with whom they go to the bathhouse sit at the table. And we understand

Based on the introduction, there is a day to raise this story. We begin to think about the backup scheme. And we select the most ideal backup scheme for this example: we do not use redundancy. This whole thing rises by any admin for half an hour with smoke breaks. Putting a web server, putting files is all. It will work. You don’t have to follow anything, you don’t need to pay special attention to anything. That is, the conclusion from example number one is pretty obvious: services that you do not need to reserve are not needed.

Example number two

Company blog: specially trained ones write news there, so we participated in such and such an exhibition, but here we released another new product and so on. Let's say this is standard PHP with WordPress, a small database and a bit of static. Of course, it comes to my mind again that you should never lie - “no more than five minutes!”, That’s all. But let's think further. What is this blog doing? They come there from Yandex, from Google on some requests, on organics. Wow. And are sales somehow related to him at all? Insight: not really. Advertising traffic goes to the main site, which is on another machine. We begin to think about the booklet reservation scheme. In a good way, it needs to be lifted in a couple of hours, and it would be nice to prepare for this. It would be reasonable to take a machine in another data center, drive an environment onto it, that is, a web server, PHP, WordPress, MySQL, and leave it lying down. At the moment when we understand that everything is broken, two things need to be done - roll the mysql dump to 50 meters, it will fly there in a minute, and roll some number of pictures from the backup there. This, too, is not good news there. Thus, in half an hour this whole thing rises. No replications, or God forgive me, automatic failover. Conclusion: what we can quickly roll out of backup is not necessary to reserve.

Example number three, more complicated

Online store. PhP with open heart is a bit filed, mysql with a solid base. Quite a lot of static (after all, the online store has beautiful HD-pictures and all that jazz), Redis for the session, and Elasticsearch for the search. We start thinking about downtime. And here, of course, it is obvious that an online store can not wallow day painlessly. After all, the longer it lies, the more money we lose. It’s worth accelerating. How much? I believe that if we lie down for an hour, then no one will go crazy. Yes, we will lose something, but if we start to zeal, it will only get worse. We determine the idle time allowable per hour.

How can all this be reserved? In any case, a car is needed: an hour of time is quite a bit. Mysql: replication, live replication is already needed here, because in an hour 100 GB in a dump, most likely, will not pour. Statics, pictures: again, in an hour 500 GB may not have time to merge. Therefore, it is better to copy pictures right away. Redis: more interesting here. The sessions are in Redis - we simply cannot take it and bury it. Because it will not be very good: all users will be logged out, baskets emptied and so on. People will be forced to re-enter their username and password, and many people may break away and not complete the purchase. Again, the conversion will fall. On the other hand, Redis is directly one to one relevant, with the last logged-in users, probably, is also not needed. And a good compromise is to take Redis and restore it from backup, yesterday, or, if you do it every hour, - an hour ago. The benefit of restoring it from backup is copying one file. And the most interesting story is Elasticsearch. Who ever raised MySQL replication? Who ever raised Elasticsearch replication? And who did she work normally after? What am I doing: we see a certain entity in our system. It seems to be useful - but it is complicated.
Complex in the sense that our fellow engineers have no experience working with it. Or there is a negative experience. Or we understand that so far this is a fairly new technology with nuances or dampness. We think ... Damn, elastic is also healthy, it also takes a long time to restore it from backup, what should I do? We understand that elastic in our case is used for search. And how does our online store sell? We go to marketers, ask, where do people come from. They answer: "90% of the Yandex Market comes directly to the product card." And either buy or not. Therefore, 10% of users need a search. And to keep elastic'a replication, especially between different data centers in different zones, there really are a lot of nuances. Which exit? We take elastic on a reserved site and do nothing with it. If the case drags on, then we will someday later, perhaps raise, but that's not for sure. Actually, the plus or minus conclusion is the same: we, again, do not reserve services that do not affect money. To keep the circuit simpler.

Example number four, even harder

Integrator: selling flowers, calling a taxi, selling goods, in general, anything. A serious thing that works 24/7 for a large number of users. With a full-fledged interesting stack, where there are interesting bases, solutions, a high load, and most importantly, it hurts him to lie more than 5 minutes. Not only and not so much because people will not buy, but because people will see that this thing is not working, they will be upset and may not come back for the second time.

Okay Five minutes. What will we do with this? In this case, we are in an adult way, with all the money we are building a real backup site, with replication of everything and everything, and maybe even automate the maximum switch to this site. And in addition to this, one must not forget to do one important thing: in fact, write the switching schedule. Regulations, even if you have everything automated, can be very simple. From the series “run such and such ansible script”, “click such and so daw in route 53” and so on - but this should be some exact list of actions.

And everything seems to be clear. Switching replication is a trivial task, or it will switch itself. Rewrite a domain name in dns - from the same series. The trouble is that when a similar project crashes, panic begins, and even the most powerful, bearded admins can be prone to it. Without a clear instruction “open a terminal, go here, the address on our server is still like this”, the term of 5 minutes allocated for resuscitation is difficult to sustain. Well, plus, when we use these regulations, it is easy to fix some changes in the infrastructure, for example, and change the regulations accordingly.
Well, if the backup system is very complex and at some point we made a mistake, then we can put our reserve site as well and, in addition, turn the data into a pumpkin on both sites - it will be really sad.

Example number five, full hardcore

An international service with hundreds of millions of users worldwide. All time zones, which only exist, highload at maximum speed, you should not lie at all. A minute - and it will be sad. What to do? Reserve, again, in full. They did everything that was mentioned in the previous example, and a little more. An ideal world, and our infrastructure - it is by all the concepts of the IaaC devopa. That is, everything in general in git, and just click the button.

What is missing? One is the teachings. You cannot do without them. It seems that everything is perfect with us, everything is under control in general. We press the button, everything happens. Even if this is so - and we understand that this does not happen - our system interacts with some other systems. For example, these are dns from route 53, s3 storage, integration with some api. We will not be able to foresee everything in this speculative experiment. And until we really pull the switch, we won’t know whether it will work or not.

That is probably all. Do not be lazy and do not overdo it. And may uptime be with you!

Tags: