List of useful ideas for highly loaded services
In this article, I decided to assemble a hodgepodge of tips on how to develop highly loaded services obtained in a practical way. For each advice, I will try to provide a small justification, without details (otherwise the article would have been comparable in size to war and peace). Since I will not give a lot of justifications, do not take this article as a dogma - in each case, the tips given here can be harmful. Always think with your own head before doing something.
It's the most important. There should not be unconditional authorities for you. If someone says complete nonsense, or says something that is contrary to your practice, do not heed such advice, and it does not matter how famous or respected this person is. If you are developing a large system, and it will not work well, then they will ask you and in this case “we followed the best world practices” is no excuse. The ability to apply the right technology in the right place makes you a valuable specialist, and not blindly following someone’s advice - this just does not require qualifications.
The systems that you develop should be understood first of all by you and your colleagues, and not by the "spherical horse in a vacuum." For example, if the finished system that you want to switch to as a whole team is not transparent and is not clear to all participants, then you can hardly hope that people will be able to use it effectively. As a rule, various kinds of problems pop up under the load even in the best and most debugged systems, and there is usually little time to fix the problems, because at that moment a large number of users suffer. Without a good understanding of how everything works, you have little chance of solving the problem quickly. You can recall, for example, how "classmates" lay for several days - in a well-designed and thought-out system, even serious failures are eliminated in a few hours, not days.
Usually no one forgets that you need to test changes before posting them. But much less often people think about what happens after the deployment, and how the system behaves in production. But actually monitoring is a logical continuation of QA for the code. Your infrastructure should also have Quality Assurance - you should see the problems in advance, and if they did arise, then be able to quickly diagnose the causes. Without monitoring, this is simply impossible.
When something important has fallen - write a post-mortem that will describe:
Often, already in preparation for writing a post-mortem, we can conclude that, in fact, nothing terrible happened. That is, eliminating the causes of the incident will cost the company more than leaving everything “as is”. Do not forget about this option - that is, sometimes something will break, and it is normal if it is within acceptable boundaries (each company has its own borders).
Despite the fact that there is no formal definition of a “high-load project”, they usually more or less understand what is meant. One of the features of applications under good load is that it is often much easier to optimize a piece of code than to spread the logic across multiple servers. That is, when developing highly loaded systems, one often has to deal with optimization and it is useful to be able to do this. If not only bandwidth is optimized, but also response time, then you are also improving the user experience. Win-win.
Unless your system fits entirely on one server, you will inevitably be faced with the need to do so-called “Two-phase commit” - the need to atomically update data in several places, for example, in the database and in the service. Two-phase commit in practice is a myth. Atomically, updating anything more than one server is not possible. There are different ways to achieve consistency in such cases, but the most common are queues - you add data and an entry to a separate table in one transaction — the queue for updating the service. Since this is a transaction on a single server, it will either complete or not complete at all. Accordingly, even if it was not possible to immediately update the data in the service, eventually the data will be consistent.
"What is still complicated? Do a normal SELECT with LIMIT / OFFSET and that’s all, there is page navigation! ” - you probably thought. You are right, but this approach works well only if the data does not change (otherwise there will be duplicates and, conversely, omissions of some records). In addition, the use of large OFFSET values usually leads to a serious loss in performance, since the database needs to literally select all the requested rows in the amount of LIMIT + OFFSET, discard OFFSET pieces and return the remaining LIMIT. This task is linear in time with respect to the OFFSET value, and for large values, this design usually slows down significantly.
Depending on the task, the solution to this problem may be different, but it is almost never simple and unambiguous. For example, you can use minId instead of the page number - you simply skip most entries by index. Or, if pages are returned as a result of a search, you need to be able to ignore new changes or somewhere to store a snapshot of data for the corresponding request.
In the general case, there is no way to efficiently expand the database in order to get a near-linear performance gain. You need to choose a sharding strategy that will be suitable specifically for your project, and you need to choose very well, because the process of resolding, and even more so changing the sharding scheme, is one of the most difficult tasks when storing a large data array, and it is also in general form practically not solved.
If you think projects like CockroachDB are a silver bullet that solves all your problems, then you're wrong. If you want to get normal performance from the system, you still need to at least in general terms understand how sharding occurs inside the database in order to minimize communication between nodes (intensive communication between nodes is the main reason that performance growth is rarely linear when adding new nodes).
Contrary to popular belief, good code is determined not only by how logical everything is in packages, whether this code can be read, whether it follows all conceivable and inconceivable style guides, and so on. This is all important, but the users of the system (users here mean not only the end users of the site, but also system administrators, other programmers, etc.) are primarily concerned that everything works as expected, and when it did not work, it was it’s clear what’s going on.
Surprisingly, a lot of software doesn’t process errors at all, or processes them in the style of “Oh, something happened”, or DDoS themselves begin to endlessly retrace themselves. Error handling is difficult, and an adequate response to them is even more so. Errors are very different, and not all of them require, for example, the completion of the program. For example, if you are developing a file system and it completely stops working (including preventing data from being deleted) when the disk is 100% full, then this is a bad file system. At the same time, you could follow all the "best practices" from famous people and you have a million stars on the github - this does not mean anything. Roughly speaking, shit happens, and even for the coolest people, the server space sometimes ends unexpectedly, and you should be able to do this.
Thank you for reading this “article”. It will be interesting to read your opinion, or your own tips, which are based on practical experience, write about it in the comments!
1. Think with your own head and check the facts
It's the most important. There should not be unconditional authorities for you. If someone says complete nonsense, or says something that is contrary to your practice, do not heed such advice, and it does not matter how famous or respected this person is. If you are developing a large system, and it will not work well, then they will ask you and in this case “we followed the best world practices” is no excuse. The ability to apply the right technology in the right place makes you a valuable specialist, and not blindly following someone’s advice - this just does not require qualifications.
2. Simple is better than complex and “right”
The systems that you develop should be understood first of all by you and your colleagues, and not by the "spherical horse in a vacuum." For example, if the finished system that you want to switch to as a whole team is not transparent and is not clear to all participants, then you can hardly hope that people will be able to use it effectively. As a rule, various kinds of problems pop up under the load even in the best and most debugged systems, and there is usually little time to fix the problems, because at that moment a large number of users suffer. Without a good understanding of how everything works, you have little chance of solving the problem quickly. You can recall, for example, how "classmates" lay for several days - in a well-designed and thought-out system, even serious failures are eliminated in a few hours, not days.
3. Do not forget about monitoring
Usually no one forgets that you need to test changes before posting them. But much less often people think about what happens after the deployment, and how the system behaves in production. But actually monitoring is a logical continuation of QA for the code. Your infrastructure should also have Quality Assurance - you should see the problems in advance, and if they did arise, then be able to quickly diagnose the causes. Without monitoring, this is simply impossible.
4. Write post-mortem (each breakdown has a cost)
When something important has fallen - write a post-mortem that will describe:
- what happened
- why did this happen
- how many users suffered (if it is possible to calculate, then how much money the company lost)
- what needs to be systematically changed so that this does not happen again (options like “be careful” do not work in practice)
Often, already in preparation for writing a post-mortem, we can conclude that, in fact, nothing terrible happened. That is, eliminating the causes of the incident will cost the company more than leaving everything “as is”. Do not forget about this option - that is, sometimes something will break, and it is normal if it is within acceptable boundaries (each company has its own borders).
5. Performance is a feature
Despite the fact that there is no formal definition of a “high-load project”, they usually more or less understand what is meant. One of the features of applications under good load is that it is often much easier to optimize a piece of code than to spread the logic across multiple servers. That is, when developing highly loaded systems, one often has to deal with optimization and it is useful to be able to do this. If not only bandwidth is optimized, but also response time, then you are also improving the user experience. Win-win.
6. A two-phase commit is difficult, but inevitable
Unless your system fits entirely on one server, you will inevitably be faced with the need to do so-called “Two-phase commit” - the need to atomically update data in several places, for example, in the database and in the service. Two-phase commit in practice is a myth. Atomically, updating anything more than one server is not possible. There are different ways to achieve consistency in such cases, but the most common are queues - you add data and an entry to a separate table in one transaction — the queue for updating the service. Since this is a transaction on a single server, it will either complete or not complete at all. Accordingly, even if it was not possible to immediately update the data in the service, eventually the data will be consistent.
7. Page navigation is a challenge
"What is still complicated? Do a normal SELECT with LIMIT / OFFSET and that’s all, there is page navigation! ” - you probably thought. You are right, but this approach works well only if the data does not change (otherwise there will be duplicates and, conversely, omissions of some records). In addition, the use of large OFFSET values usually leads to a serious loss in performance, since the database needs to literally select all the requested rows in the amount of LIMIT + OFFSET, discard OFFSET pieces and return the remaining LIMIT. This task is linear in time with respect to the OFFSET value, and for large values, this design usually slows down significantly.
Depending on the task, the solution to this problem may be different, but it is almost never simple and unambiguous. For example, you can use minId instead of the page number - you simply skip most entries by index. Or, if pages are returned as a result of a search, you need to be able to ignore new changes or somewhere to store a snapshot of data for the corresponding request.
8. (Re) Sharding is a difficult task
In the general case, there is no way to efficiently expand the database in order to get a near-linear performance gain. You need to choose a sharding strategy that will be suitable specifically for your project, and you need to choose very well, because the process of resolding, and even more so changing the sharding scheme, is one of the most difficult tasks when storing a large data array, and it is also in general form practically not solved.
If you think projects like CockroachDB are a silver bullet that solves all your problems, then you're wrong. If you want to get normal performance from the system, you still need to at least in general terms understand how sharding occurs inside the database in order to minimize communication between nodes (intensive communication between nodes is the main reason that performance growth is rarely linear when adding new nodes).
9. Good code differs from bad in how errors are handled
Contrary to popular belief, good code is determined not only by how logical everything is in packages, whether this code can be read, whether it follows all conceivable and inconceivable style guides, and so on. This is all important, but the users of the system (users here mean not only the end users of the site, but also system administrators, other programmers, etc.) are primarily concerned that everything works as expected, and when it did not work, it was it’s clear what’s going on.
Surprisingly, a lot of software doesn’t process errors at all, or processes them in the style of “Oh, something happened”, or DDoS themselves begin to endlessly retrace themselves. Error handling is difficult, and an adequate response to them is even more so. Errors are very different, and not all of them require, for example, the completion of the program. For example, if you are developing a file system and it completely stops working (including preventing data from being deleted) when the disk is 100% full, then this is a bad file system. At the same time, you could follow all the "best practices" from famous people and you have a million stars on the github - this does not mean anything. Roughly speaking, shit happens, and even for the coolest people, the server space sometimes ends unexpectedly, and you should be able to do this.
That's all
Thank you for reading this “article”. It will be interesting to read your opinion, or your own tips, which are based on practical experience, write about it in the comments!