Horizontal scaling of PHP applications. Part 1
So you have made a website. It is always interesting and exciting to watch how the counter of visits slowly but surely creeps up, every day showing all the best results. But one day, when you do not expect this, someone will post a link to your resource on some Reddit or Hacker News (or on Habr - approx. Per.), And your server will lie down.
Instead of getting new regular users, you will be left with a blank page. At this point, nothing will help you restore the server to working, and traffic will be lost forever. How to avoid such problems? In this article we will talk about optimization and scaling .
A little bit about optimization
The main tips are well known to everyone: upgrade to the latest version of PHP (OpCache is now integrated in 5.5), sort out the indexes in the database, cache statics (rarely modified pages, such as “About Us”, “FAQ”, etc.).
It is also worth mentioning one special aspect of optimization - serving static content with a non-Apache server, such as, for example, Nginx, Configure Nginx to process all static content (* .jpg, * .png, * .mp4, * .html ...), And let the files requiring server processing send to heavy Apache. This is called reverse proxy .
There are two types of scaling - vertical and horizontal.
In my understanding, a site is scalable if it can handle traffic without changing the software.
Imagine a server serving a web application. It has 4GB RAM, i5 processor and 1TB HDD. It performs its functions perfectly, but in order to better cope with higher traffic, you decide to increase RAM to 16GB, install an i7 processor, and fork out on an SSD drive. Now the server is much more powerful, and copes with high loads. This is vertical scaling.
Horizontal scaling - creating a cluster of interconnected (often not very powerful) servers that serve the site together. In this case, use a load balancer (aka the load balancer ) - machine or a program whose main function - to determine which server to send the request. Servers in the cluster share application maintenance without knowing anything about each other, thus significantly increasing the throughput and fault tolerance of your site.
There are two types of balancers - hardware and software. Software - is installed on a regular server and receives all traffic, passing it to the corresponding processors. Such a balancer can be, for example, Nginx. In the “Optimization” section, he intercepted all requests for static files, and served these requests himself, without burdening Apache. Another popular load balancing software is Squid . Personally, I always use it, because It provides an excellent user-friendly interface to control the deepest aspects of balancing.
A hardware balancer is a dedicated machine whose sole purpose is to distribute the load. Usually on this machine, no software other than developed by the manufacturer is no longer worth it. Read about hardware load balancers here .
Please note that these two methods are not mutually exclusive. You can vertically scale any machine (aka Noda ) in your system.
In this article, we discuss horizontal scaling because it is cheaper and more efficient, although more difficult to implement.
When scaling PHP applications, there are several difficult problems. One of them is working with user session data. After all, if you logged on to the site, and the balancer sent your next request to another machine, then the new machine will not know that you are already logged in. In this case, you can use persistent connection. This means that the balancer remembers which node sent the user’s request the last time, and sends the next request there. However, it turns out that the balancer is too overloaded with functions, in addition to processing hundreds of thousands of requests, he also has to remember exactly how he processed them, as a result, the balancer becomes a bottleneck in the system.
Exchange of local data.
Sharing user session data between all nodes in the cluster seems like a good idea. And despite the fact that this approach requires some changes in the architecture of your application, it's worth it - the balancer is unloaded, and the entire cluster becomes more fault tolerant. The death of one of the servers does not affect the operation of the entire system.
As we know, session data is stored in the $ _SESSION superglobal array , which writes and takes data from a file on disk. If this disk is located on one server, it is obvious that other servers do not have access to it. How do we make it available on multiple machines?
First, note that you can override the session handler in PHP . You can implement your own class forwork with sessions .
Using DB to store sessions
Using our own session handler, we can store them in the database. The database can be on a separate server (or even a cluster). Usually this method works fine, but with really big traffic, the database becomes a bottleneck (and with the loss of the database we completely lose working capacity), because it has to service all the servers, each of which is trying to write or read session data.
Distributed file system
Perhaps you are thinking that it would be nice to set up a network file system where all servers could write session data. Do not do this! This is a very slow approach, leading to data corruption or even data loss. If, for some reason, you still decide to use this method, I recommend GlusterFS
You can also use memcached to store session data in RAM. However, this is not safe, because data in memcached is overwritten if free space runs out. You are probably wondering if RAM is not divided by machine? How is it applied to the entire cluster? Memcached has the ability to combine the available RAM on different machines into one pool .
The more machines you have, the more you can allocate to this memory pool. You do not have to pool all the memory of the machines into a pool, but you can, and you can donate any amount of memory from each machine to the pool. So, it is possible to leave used on most of the memory for normal use, and highlight the piece for the cache, which will allow not only to cache the session, but other suitable information. Memcached is a great and widespread solution .
To use this approach, you need to slightly modify php.ini
session.save_handler = memcache session.save_path = "tcp://path.to.memcached.server:port"
Redis - NoSQL data warehouse. Stores the database in RAM. In contrast, memcached supports persistent data storage, and more complex data types. Redis does not support clustering , so using it for horizontal scaling is somewhat difficult, however, this is temporary, and an alpha version of the cluster solution has already been released .
ZSCM is a good alternative from Zend, but requires a Zend Server on each node.
If you are interested in other NoSQL repositories and caching systems, try Scache , Cassandra, or Couchbase .
As you can see, horizontal scaling of PHP applications is not such a simple matter. There are many difficulties, most of the solutions are not interchangeable, so you have to choose one and stick to it until the end, because when the traffic goes through the roof - there is no way to smoothly switch to something else.
I hope this little guide helps you choose the scaling approach for your project.
In the second part of the article we will talk about scaling the database .