Disk Balancing in Nginx
In this article I will describe an interesting solution based on Nginx for the case when the disk system becomes a bottleneck when distributing content (for example, video).
Formulation of the problem
We have a task: it is necessary to send clients static files (video) with a total distribution band of tens of gigabits per second.
For obvious reasons, such a band cannot be distributed directly from the repository; caching must be applied. The volume of content, which makes up the majority of the generated traffic, is several orders of magnitude larger than the RAM of one server, therefore caching in RAM is not possible, the cache will have to be stored on disks.
Network channels of sufficient capacity are available a priori, otherwise the task would be unsolvable.
Choosing a solution
In this situation, disks become a problematic place: for a server to produce 20 gigabits of traffic per second (two optical fibers in an aggregate), it must read ~ 2400 megabytes per second of useful data from disks. In addition to this, disks can also be busy writing to the cache.
To scale the performance of the disk system, RAID arrays with alternating blocks are used. The bet is that when reading a file, its blocks will be on different disks and the speed of sequential reading of the file will be on average equal to the speed of the slowest disk times the number of alternating disks.
The problem with this approach is that it works effectively only for the ideal case when they read a sufficiently long file (the file size is much larger than the size of the strip unit), which is located inside the file system without fragmentation. For parallel reading of many small and / or fragmented files, this approach does not even allow to approach the total speed of all disks. For example, RAID0 of six ssd disks at 100% load of the I / O queue gave the speed like two disks.
Practice has shown that it is more profitable to share files between disks as a whole, using separate file systems. This ensures that every disk is recycled because they are independent.
Implementation
As mentioned above, we will cache nginx. The idea is to split the distributed files between the disks evenly. To do this, in the simplest case, it is enough to hash a mapping of multiple URLs into multiple drives. That's what we will do, but first things first.
We define cache zones by the number of disks. In my example there are 10.
In the section
http
: proxy_cache_path /var/www/cache1 levels=1:2 keys_zone=cache1:100m inactive=365d max_size=200g;
proxy_cache_path /var/www/cache2 levels=1:2 keys_zone=cache2:100m inactive=365d max_size=200g;
...
proxy_cache_path /var/www/cache10 levels=1:2 keys_zone=cache10:100m inactive=365d max_size=200g;
A separate disk is mounted in the directory of each caching zone.
The content sources will be three upstream, two servers in each group:
upstream src1 {
server 192.168.1.10;
server 192.168.1.11;
}
upstream src2 {
server 192.168.1.12;
server 192.168.1.13;
}
upstream src3 {
server 192.168.1.14;
server 192.168.1.15;
}
This is an unprincipled moment, taken for credibility.
Section
server
:server {
listen 80 default;
server_name localhost.localdomain;
access_log /var/log/nginx/video.access.log combined buffer=128k;
proxy_cache_key $uri;
set_by_lua_file $cache_zone /etc/nginx/cache_director.lua 10 $uri_without_args;
proxy_cache_min_uses 0;
proxy_cache_valid 1y;
proxy_next_upstream error timeout invalid_header http_500 http_502 http_503 http_504 http_404;
location ~* ^/site1/.*$ {
set $be "src1";
include director;
}
location ~* ^/site2/.*$ {
set $be "src2";
include director;
}
location ~* ^/site3/.*$ {
set $be "src3";
include director;
}
location @cache1 {
bytes on;
proxy_temp_path /var/www/cache1/tmp 1 2;
proxy_cache cache1;
proxy_pass http://$be;
}
location @cache2 {
bytes on;
proxy_temp_path /var/www/cache2/tmp 1 2;
proxy_cache cache2;
proxy_pass http://$be;
}
...
location @cache10 {
bytes on;
proxy_temp_path /var/www/cache10/tmp 1 2;
proxy_cache cache10;
proxy_pass http://$be;
}
}
The set_by_lua_file directive selects the appropriate drive for this URL by hashing. For a conditional “site”, a backend is selected and stored. Then, in the director file, a redirection to the internal location occurs, which serves the request from the selected backend, storing the response in the cache specified for this URL.
And here
director
:if ($cache_zone = 0) { return 481; }
if ($cache_zone = 1) { return 482; }
...
if ($cache_zone = 9) { return 490; }
error_page 481 = @cache1;
error_page 482 = @cache2;
...
error_page 490 = @cache10;
It looks awful, but this is the only way.
The entire configuration of the salt in the hashing URL-> drive
cache_director.lua
:function shards_vector(base, seed)
local result = {}
local shards = {}
for shard_n=0,base-1 do table.insert(shards, shard_n) end
for b=base,1,-1 do
choosen = math.fmod(seed, b)+1
table.insert(result, shards[choosen])
table.remove(shards, choosen)
seed = math.floor(seed / b)
end
return result
end
function file_exists(filename)
local file = io.open(filename)
if file then
io.close(file)
return 1
else
return 0
end
end
disks = ngx.arg[1]
url = ngx.arg[2]
sum = 0
for c in url:gmatch"." do
sum = sum + string.byte(c)
end
sh_v = shards_vector(disks, sum)
for _, v in pairs(sh_v) do
if file_exists("/var/www/cache" .. (tonumber(v)+1) .. "/ready") == 1 then
return v
end
end
In the directive
set_by_lua_file
mentioned above, this code gets the number of drives and URLs. The idea of directly mapping a URL to a drive is good until at least one drive fails. Redirecting URLs from a problem drive to a healthy one needs to be done the same for a specific URL (otherwise there will be no cache hits) and at the same time it should be different for different URLs so that there is no load skipping. Both of these properties must be preserved if the replacement replacement (etc.) also fails. Therefore, for a system of n disks, I map the URL to a lot of all kinds of permutations of these n disks and then, in the order of these disks in the arrangement, I try to use the appropriate caches. The criterion for disk (cache) activity is the presence of a flag file in its directory. I have to chattr these files so that nginx does not delete them.results
This smearing of content across disks really allows you to use the full speed of disk devices. A server with 6 inexpensive SSD disks with a workload was able to develop a return of about 1200 MB / s, which corresponds to the total speed of the disks. The array speed fluctuated around 400 MB / s