5 lessons for developers of high-load systems

Since 2010, we have been developing a service for organizing collaboration and process management. Now thousands of organizations and tens of thousands of users work in our Pyrus system. For 4 years we have gained good experience in ensuring reliability and want to share it with you.

1. Everything can break

We try to lay straws wherever possible. All database servers are 100% mirrored. In data centers there are scheduled works for 2-3 hours, at this time our service should not interrupt work. Therefore, mirrors are located in different data centers, and even in different countries. In addition, you must regularly install security updates on the servers, and they sometimes require a reboot. In such cases, a hot switch to the backup server is most welcome.

The servers are RAID, we do daily backups. We have several application servers, this provides scalability and allows you to update them in turn, without interrupting service. To balance the load, we use the round robin DNS mechanism. We assumed that DNS is the most reliable Internet system, because without it, no site will not open. However, a surprise awaited us here.

We hosted the domain zone at a large registrar regist.com, it serves more than 3 million domains. As expected, we have 2 independent domain name servers (nameserver), which protects one of them from failure. One morning both failed. The register.com management console was unavailable. On Twitter, timid complaints of users began to appear, which in an hour gave way to an avalanche-like stream of screams, cries, moans and promises to immediately leave this provider. As soon as he turns on the server back.

Since then, we have transferred our domain zone to Amazon, which provides 4 domain name servers located in different root zones of the Internet: .com, .net, .org, .uk. This provides an additional level of reliability: even if the entire .com domain zone is unavailable in DNS for some reason, clients will still be able to work with our service.

Conclusion: design a system knowing that sooner or later any component will fail. Remember Murphy: if there is a chance that some kind of trouble can happen, then it will definitely happen

2. You do not know where the bottleneck of your application

As the load grows, we constantly do 2 things: we buy memory (RAM) and optimize the application. But how to understand which function in the application is not fast enough? It is difficult to judge by synthetic measurements in tests on the developer's machine. Running a profiler on a combat server is almost impossible - it adds too much overhead and the service starts to slow down.

It is necessary to insert control points into the code and evaluate the application speed by the program execution time between them.

So we found out that 1/3 of the processor time is spent on ... serialization: packing data structures into JSON strings. Having studied alternative serialization libraries, we made an unpopular decision: write our own. Implementation for our specific tasks worked 2 times faster than the fastest alternative solution available on the market.

By the way, many mistakenly believe that encryption consumes a lot of processor resources. Previously, this process could really “eat up” up to 20% of CPU resources. However, starting with the Westmere architecture, launched in January 2010, AES encryption algorithm instructions are included in the Intel processor instruction set. Therefore, switching from HTTP to HTTPS practically does not change the load on the processor.

Conclusion:do not optimize prematurely. Without accurate measurements, your suggestions that you need to speed up are likely to be erroneous.

3. Test everything

Once we needed to change the structure of the table in the database. This procedure requires stopping the service, so we planned it at the least busy time - at night on the weekend. Our tests showed a runtime of less than one minute. We stopped the service and started the procedure on the combat server, however, it did not finish the work in either one or ten minutes.

It turned out that the procedure in some cases begins to rebuild the cluster index in the table, the size of which at that time was about 1TB. We did not notice this, because we conducted tests on a small table. I had to, without waiting for the end of the procedure, start the service. To our good fortune, all the basic functions worked correctly, although somewhat more slowly than usual, except for attaching files to tasks. The procedure finished working after a couple of hours and 100% working capacity was restored.

Conclusion: test all changes on data volumes close to combat. We run about 500 automated tests every time we build the application to ensure there are no fatal errors.

4. Testing speed should be high

We release app updates every week. A couple of times a year, despite testing, a bug creeps into the release, a small but unpleasant one. Typically, such errors are detected within 10 minutes after release. In such cases, we release a hotfix.

No one likes rolling back releases, but sometimes it's necessary. Correction must be done quickly, often we find the cause of the error within half an hour. But for the release to hit the battle servers, the source code must go through auto-assembly and automatic tests. Our 500 tests run for more than 20 minutes, which is fast enough, but we plan to further reduce this time by means of more parallelization.

With slow testing, we could not fix the bugs so quickly, and without tests, the number of bugs would be higher.

Conclusion:Do not spare money on resources for developers. Buy productive servers for automated tests, the number of which will constantly grow.

5. Each product function must be used.

Good products require many iterations. New features are constantly being added to the product, but often it is necessary to cut out rarely used functions. They do not carry value: they spend the time of developers on their support and take up extra space on the screen.

A good gardener cuts young shoots every spring and forms a regular, healthy and beautiful tree crown.

Does your product have features that no one uses? In Pyrus we do not know such.

Empirically, we have developed a rule: at least 2% of users use each feature. This means when we turn off the function, dozens or hundreds of people do not like it. We always provide another way to do the same, but the habit is stronger.

Conclusion:development requires some sacrifice. Imagine how many people don't like every change in Google and Microsoft products.

Also popular now: