Disturbed August 24, 2010 at 13:12

6 Ways to Kill Your Servers - Knowing Scalability the Hard Way

Transfer

To learn how to scale your application without having any experience is very difficult. Now there are many sites devoted to these issues, but, unfortunately, there is no solution that is suitable for all cases. You still need to find the solutions that suit your requirements. Just like me.

A few years ago my boss came to me and said: “We have a new project for you. This is a site transfer that already has 1 million visitors per month. You need to transfer it and make sure that attendance can grow in the future without any problems. ”I was already an experienced programmer, but had no experience in the field of scalability. And I had to learn scalability the hard way.

The site software was a CMS in PHP, using MySQL and Smarty. First of all, a hosting company was found that had experience in highly loaded projects. We provided them with our required configuration:

Load balancing (with margin)
2 web servers
MySQL server (with a margin)
development machine

What we got (the hoster said that this will be enough):

Load Balancing - Single core, 1 GB RAM, Pound
2 web servers - Dual core, 4 GB RAM, Apache
MySQL Server - Quad core, 8 GB RAM
development machine - Single core, 1 GB RAM

To synchronize files, the hoster installed DRBD in the active-active configuration.

Finally, the transfer time has come. Early in the morning, we switched the domain to new IPs and began to monitor our scripts. We got traffic almost immediately and it seemed that everything was working fine. Pages loaded quickly, MySQL processed a bunch of requests and everyone was happy.

Then the phone rang unexpectedly: “We can’t access the website, what’s going on ?!” We looked at our monitoring software and saw that the servers had crashed and the site was down. Of course, the first thing we did was call the hoster and say: “all of our servers have crashed. What's going on ?! ”They promised to check the server and call back after that. After some time, they called: “Your file system is hopelessly corrupted. What were you doing ?! ”They stopped the balancer and told me to look at one of the web servers. By opening index.php, I was shocked. It contained incomprehensible pieces of C code, error messages, and something similar to log files. After a little investigation, we found that our DRBD was the cause.

Lesson number 1

Put the Smarty cache in an active-active DRBD cluster under high load and your site will crash.

While the hoster was recovering the web server, I rewritten part of the CMS so that Smarty cache files were stored on the local file system. The problem was found and fixed and we returned online.

Now it was the beginning of the day. Usually the peak of attendance was at the end of the day and until the early evening. There were practically no visitors at night. We again began to observe the system. The site was loading, but as the peak time approached, the load increased and responses slowed down. I lengthened the life of the Smarty cache, hoping this would help, but that didn't help. Soon, the servers began to generate timeout errors or blank pages. Two web servers could not cope with the load.

Our client was on the nerves, but he understood that moving usually brings with it some problems.

We needed to somehow reduce the load and we discussed this with the hoster. One of their admins suggested a good idea: “Your servers are now on Apache + mod_php. Can translate to Lighttpd? This is a small project, but even Wikipedia uses it. ”We agreed.

Lesson number 2

Install the web server “out of the box” on your servers, do not configure anything and your site will fall.

The administrator reconfigured our servers as fast as he could. He abandoned Apache and switched to the Lighttpd + FastCGI + Xcache configuration. How many servers are stretched this time?

Surprisingly, the servers worked well. The load was significantly less than before, and the average response time was good. After that, we decided to go home and relax - it was too late and we agreed that so far we have nothing more to do.

In the following days, the servers handled the load relatively well, but at peak times they were close to falling. We found that MySQL was the bottleneck and again called the host. They advised master-slave MySQL replication with slave on each web server.

Lesson number 3

Even a productive database server has limitations - and when they are reached, your site will crash.

This problem was not so easy to fix. CMS was very simple in this regard, and it did not have the built-in ability to split SQL queries. Modification of this took some time, but the result was worthwhile.

MySQL replication truly worked a miracle and the site was finally stable. Over the next weeks, the site began to gain popularity and the number of users began to constantly increase. And it was only a matter of time before traffic again exceeded our resources.

Lesson number 4

Do not plan ahead and your site will fall sooner or later.

Fortunately, we continued to observe and plan. We optimized the code, reduced the number of SQL queries, and unexpectedly learned about MemCached. For starters, I added MemCached to some of the main features that were the most difficult. When we deployed our changes to production, we could not believe the results - as if we had found the Holy Grail. We have reduced the number of requests per second by at least 50%. Instead of buying another web server, we decided it was better to use MemCached.

Lesson number 5

Do not cache anything and spend money on new hardware or your site will fall.

MemCached helped us reduce the load on MySQL by 70-80%, which led to a huge increase in performance. The page loaded even faster.

Finally, our configuration seemed perfect. Even at peak times, we did not have to worry about possible falls or a long response time. But suddenly, one of the web servers started to bring us problems - error messages, blank pages, etc. Everything was fine with the load, and in most cases the server worked fine, but only in “most cases”.

Lesson number 6

Put several hundred thousand small files in one directory, forget about Inode and your site will fall.

Yes it is. We were so busy optimizing MySQL, PHP, and web servers that we did not pay enough attention to the file system. Smart smart cache was stored in the local FS in one directory. The solution was to migrate Smarty to a separate partition with ReiserFS . We have also included the Smarty 'use_subdirs' option.

Over the next years, we continued to optimize. We put Smarty cache in memcached, installed Varnish to reduce the load on the I / O system, switched to Nginx (Lighttpd randomly generated a 500 error), bought the best hardware, and so on.

Conclusion

Scaling a website is an endless process. As soon as you fix one bottleneck, most likely you will come across the following. Never think, "That's it, we're done." This will ruin your servers and possibly your business. Optimization is an ongoing process. If you cannot do the work yourself due to lack of experience / resources - find a competent partner to work together. Never stop discussing with your team and partners current problems and problems that may arise in the future.

About the Author - Steffen Konerow, author of the High Performance Blog .

Tags: