Disk Balancing in Nginx

    In this article I will describe an interesting solution based on Nginx for the case when the disk system becomes a bottleneck when distributing content (for example, video).

    Formulation of the problem

    We have a task: it is necessary to send clients static files (video) with a total distribution band of tens of gigabits per second.

    For obvious reasons, such a band cannot be distributed directly from the repository; caching must be applied. The volume of content, which makes up the majority of the generated traffic, is several orders of magnitude larger than the RAM of one server, therefore caching in RAM is not possible, the cache will have to be stored on disks.

    Network channels of sufficient capacity are available a priori, otherwise the task would be unsolvable.

    Choosing a solution

    In this situation, disks become a problematic place: for a server to produce 20 gigabits of traffic per second (two optical fibers in an aggregate), it must read ~ 2400 megabytes per second of useful data from disks. In addition to this, disks can also be busy writing to the cache.
    To scale the performance of the disk system, RAID arrays with alternating blocks are used. The bet is that when reading a file, its blocks will be on different disks and the speed of sequential reading of the file will be on average equal to the speed of the slowest disk times the number of alternating disks.
    The problem with this approach is that it works effectively only for the ideal case when they read a sufficiently long file (the file size is much larger than the size of the strip unit), which is located inside the file system without fragmentation. For parallel reading of many small and / or fragmented files, this approach does not even allow to approach the total speed of all disks. For example, RAID0 of six ssd disks at 100% load of the I / O queue gave the speed like two disks.
    Practice has shown that it is more profitable to share files between disks as a whole, using separate file systems. This ensures that every disk is recycled because they are independent.


    As mentioned above, we will cache nginx. The idea is to split the distributed files between the disks evenly. To do this, in the simplest case, it is enough to hash a mapping of multiple URLs into multiple drives. That's what we will do, but first things first.
    We define cache zones by the number of disks. In my example there are 10.
    In the section http:
        proxy_cache_path  /var/www/cache1  levels=1:2  keys_zone=cache1:100m inactive=365d max_size=200g;
        proxy_cache_path  /var/www/cache2  levels=1:2  keys_zone=cache2:100m inactive=365d max_size=200g;
        proxy_cache_path  /var/www/cache10 levels=1:2  keys_zone=cache10:100m inactive=365d max_size=200g;

    A separate disk is mounted in the directory of each caching zone.

    The content sources will be three upstream, two servers in each group:
    upstream src1 {
    upstream src2 {
    upstream src3 {

    This is an unprincipled moment, taken for credibility.

    Section server:

    server {
    	listen   80 default;
    	server_name  localhost.localdomain;
    	access_log      /var/log/nginx/video.access.log combined buffer=128k;
    	proxy_cache_key $uri;
    	set_by_lua_file $cache_zone /etc/nginx/cache_director.lua 10 $uri_without_args;
    	proxy_cache_min_uses 0;
    	proxy_cache_valid  1y;
    	proxy_next_upstream error timeout invalid_header http_500 http_502 http_503 http_504 http_404;
    	location ~* ^/site1/.*$ {
    		set $be "src1";
    		include director;
    	location ~* ^/site2/.*$ {
    		set $be "src2";
    		include director;
    	location ~* ^/site3/.*$ {
    		set $be "src3";
    		include director;
        location @cache1 {
                bytes on;
                proxy_temp_path /var/www/cache1/tmp 1 2;
                proxy_cache cache1;
                proxy_pass           http://$be;
        location @cache2 {
                bytes on;
                proxy_temp_path /var/www/cache2/tmp 1 2;
                proxy_cache cache2;
                proxy_pass           http://$be;
        location @cache10 {
                bytes on;
                proxy_temp_path /var/www/cache10/tmp 1 2;
                proxy_cache cache10;
                proxy_pass           http://$be;

    The set_by_lua_file directive selects the appropriate drive for this URL by hashing. For a conditional “site”, a backend is selected and stored. Then, in the director file, a redirection to the internal location occurs, which serves the request from the selected backend, storing the response in the cache specified for this URL.

    And here director:
    if ($cache_zone = 0) { return 481; }
    if ($cache_zone = 1) { return 482; }
    if ($cache_zone = 9) { return 490; }
    error_page 481 = @cache1;
    error_page 482 = @cache2;
    error_page 490 = @cache10;

    It looks awful, but this is the only way.

    The entire configuration of the salt in the hashing URL-> drive cache_director.lua:
    function shards_vector(base, seed)
        local result = {}
        local shards = {}
        for shard_n=0,base-1 do table.insert(shards, shard_n) end
        for b=base,1,-1 do
            choosen = math.fmod(seed, b)+1
            table.insert(result, shards[choosen])
            table.remove(shards, choosen)
            seed = math.floor(seed / b)
        return result
    function file_exists(filename)
      local file = io.open(filename)
      if file then
        return 1
        return 0
    disks = ngx.arg[1]
    url   = ngx.arg[2]
    sum   = 0
    for c in url:gmatch"." do
        sum = sum + string.byte(c)
    sh_v = shards_vector(disks, sum)
    for _, v in pairs(sh_v) do
        if file_exists("/var/www/cache" .. (tonumber(v)+1) .. "/ready") == 1 then
            return v

    In the directive set_by_lua_filementioned above, this code gets the number of drives and URLs. The idea of ​​directly mapping a URL to a drive is good until at least one drive fails. Redirecting URLs from a problem drive to a healthy one needs to be done the same for a specific URL (otherwise there will be no cache hits) and at the same time it should be different for different URLs so that there is no load skipping. Both of these properties must be preserved if the replacement replacement (etc.) also fails. Therefore, for a system of n disks, I map the URL to a lot of all kinds of permutations of these n disks and then, in the order of these disks in the arrangement, I try to use the appropriate caches. The criterion for disk (cache) activity is the presence of a flag file in its directory. I have to chattr these files so that nginx does not delete them.


    This smearing of content across disks really allows you to use the full speed of disk devices. A server with 6 inexpensive SSD disks with a workload was able to develop a return of about 1200 MB / s, which corresponds to the total speed of the disks. The array speed fluctuated around 400 MB / s

    Also popular now: