We accelerate the distribution of photos

Published on December 09, 2010

We accelerate the distribution of photos

    Every system administrator sooner or later faces the problem of slow return of static content.

    This manifests itself approximately like this: sometimes a 3Kb picture is loaded as if it weighs 3Mb, css and JavaScript start to “stick” (give out very slowly) out of the blue. You press ctrl + reload - and, it seems, there is no problem, then after only a few minutes everything repeats again.

    The true reason for the “brakes” is not always obvious, and we glance at the nginx, the hoster, the “clogged” channel, or the “brake” or “buggy” browser :)

    Actually, the problem is the imperfection of the modern hard drive, which still not parted with the mechanical subsystems of spindle rotation and head positioning.

    In this article I will offer you my solution to this problem, based on practical experience in using SSD drives with the nginx web server.

    How to understand that hard brakes?

    On Linux, disk system speed issues are directly related to the iowait parameter (percentage of CPU idle waiting for I / O). In order to monitor this parameter there are several commands: mpstat , iostat , sar . I usually run iostat 5 (measurements will be taken every 5 seconds).
    I am calm about a server with an average iowait of up to 0.5% . Most likely on your server of "distribution" this parameter will be higher. It makes sense not to delay optimization if iowait> 10% Your system spends a lot of time moving the heads on the hard drive instead of reading information, this can lead to "braking" and other processes on the server.

    What to do with a big iowait?

    Obviously, if you reduce the number of disk I / O, the hard drive stant easier and iowait will fall.
    Here are some suggestions:
    • Disable access_log
    • Disable updating the date of the last access to the file and directory, also allow the system to cache write operations to disk. To do this, mount the file system with the following options: async, noatime, barrier = 0 . ('barrier = 0' unjustified risk if the database is on the same partition)
    • You can increase the timeout between flushing dirty vm.dirty_writeback_centisecs buffers in /etc/sysctl.conf . I have set vm.dirty_writeback_centisecs = 15000
    • Have you accidentally forgotten about the expires max directive ?
    • It will not be superfluous to turn on file descriptor caching .
    • Application of client optimization: css sprites, all css - in one file, all js - in one file

    This will help a little and give time to hold out until the upgrade. If the project is growing, iowait will soon remind of itself. :)

    Upgrade iron

    • Install RAM.
      Perhaps you can start with RAM. Linux uses all the "free" RAM and will place the disk cache there.
    • Old Proven RAID
      You can assemble software or hardware RAID from multiple HDDs. In some cases, it makes sense to increase the number of hard drives, but do not collect them in RAID (for example, distributing iso-images of a disk, large video files, ...)
    • Solid-state drive: let's try something new
      Well, and in my opinion, the cheapest option to upgrade is to install one or more SSD drives into the system. Today, as you may have already guessed, we will talk about this particular method of acceleration.

    The CPU upgrade will not affect the distribution speed at all, because it does not “slow down”! :)

    Why SSD

    A year and a half ago, when I wrote the article “Tuning nginx” , one of my options for accelerating nginx was to use the SSD of the hard drive. The habrasociety restrainedly showed interest in this technology , there was information about the possible braking over time of the SSD and the fear for a small number of rewriting cycles.
    Very soon after the publication of that article, Kingston SNE125-S2 / 64GB appeared on our company based on the Intel x25e SSD, which is still used today on one of the most loaded “distribution” servers.

    After a year of experiments, a number of shortcomings appeared, which I would like to talk about:
    • Advertising trick: if the SSD advertisement says that the maximum read speed is 250 Mb / s , then this means that the average read speed will be ~ 75% (~ 190 Mb / s) of the declared maximum. I was so with MLC and SLC, expensive and cheap.
    • The larger the volume of one SSD, the higher the cost of 1MB on this drive
    • Most file systems are not adapted for use on SSDs and can create an uneven load on writing to disk.
    • Only the most modern (and therefore the most expensive) RAID controllers are adapted to connect SSDs to them
    • SSD is still expensive technology

    Why am I using SSD:
    • Advertising doesn’t lie - seek-to-seek really strives for 0. This allows you to significantly reduce iowait in parallel "distribution" of a large number of files.
    • Yes, indeed, the number of rewrite cycles is limited, but we know about it and can minimize the amount of rewritable information using the method described below
    • Disks using SLC (Single-Level Cell) technology with a smart controller are already available, the number of rewriting cycles which are an order of magnitude higher than ordinary MLC SSDs
    • Modern file systems (e.g. btrfs ) are already able to work correctly with SSD
    • As a rule, a caching server requires a small amount of cache space (we have 100-200G), which can fit on 1 SSD. It turns out that this is significantly cheaper than a solution based on a hardware RAID array with several SAS disks

    Configure SSD cache

    Choosing a file system
    At the beginning of the experiment, ext4 was installed on Kingston SNE125-S2 / 64GB. On the Internet you will find many recommendations on how to “chop off” logging, dates of the last access to files, etc. Everything worked perfectly and for a long time. Most importantly, it didn’t suit you - with a large number of small pictures 1-5K on a 64G SSD, less than half fit - ~ 20G. I began to suspect that my SSD was not being used rationally.

    I upgraded the kernel to 2.6.35 and decided to try (still experimental) btrfs, there is an opportunity to indicate when mounting that ssd is mounted. The disk can not be divided into partitions, as is customary, but formatted as a whole.

    mkfs.btrfs /dev/sdb

    When mounting, you can disable many features that we do not need and enable compression of files and metadata. (In fact, jpeg will not be compressed, btrfs is smart, only metadata will be compressed). Here's what my mount line in fstab looks like (all in one line):

    UUID = 7db90cb2-8a57-42e3-86bc-013cc0bcb30e / var / www / ssd btrfs device = / dev / sdb, device = / dev / sdc, device = / dev / sdd, noatime, ssd, nobarrier, compress, nodatacow, nodatasum, noacl, notreelog 1 2

    You can find out the UUID of a formatted disk with the command
    blkid /dev/sdb

    As a result, more than 41G "climbed" onto the disk (2 times more than on ext4). At the same time, the distribution speed did not suffer (since iowait did not increase).

    We collect RAID from SSD
    There came a moment when the 64G SSD became small, I wanted to collect several SSDs into one large partition and at the same time I wanted to use not only expensive SLCs, but also ordinary MLC SSDs. Here we need to insert a little theory:

    Btrfs saves 3 types of data on disk: data about the file system itself, addresses of metadata blocks (there are always 2 copies of metadata on the disk) and, in fact, the data itself (file contents). Experimentally, I found that “compressed” metadata in our directory structure occupies ~ 30% of all data in a section. Metadata is the most intensively changed block, because any addition of a file, file transfer, change of access rights entails overwriting a metadata block. The area where data is simply stored is overwritten less frequently. So we come to the most interesting feature of btrfs: it is to create software RAID-arrays and indicate explicitly on which disks to store data on which metadata.

    mkfs.btrfs -m single /dev/sdc -d raid0 /dev/sdb /dev/sdd

    as a result, metadata will be created on / dev / sdc and the data on / dev / sdb and / dev / sdd, which will be collected in stripped raid. Moreover, you can connect more disks to the existing system , perform data balancing, etc.

    To find out the UUID btrfs RAID, run:
    btrfs device scan

    Attention: the peculiarity of working with btrfs-ride: before each mount the RAID array and after loading the btrfs module, you must run the command: btrfs device scan . For automatic mounting via fstab, you can do without the 'btrfs device scan' by adding the device options to the mount line. Example:
    /dev/sdb     /mnt    btrfs    device=/dev/sdb,device=/dev/sdc,device=/dev/sdd,device=/dev/sde

    Caching on nginx without proxy_cache

    I assume that you have a storage server on which all the content is located, it has a lot of space and the usual "slow" SATA hard drives that are not able to hold a large load of shared access.
    There is a “distribution” server between the storage server and users of the site, whose task is to remove the load from the storage server and ensure uninterrupted “distribution” of statics to any number of clients.

    Install one or more SSDs with btrfs on board on the distribution server. This directly begs the nginx configuration based on proxy_cache. But she has several disadvantages for our system:
    • proxy_cache at each restart nginx starts to gradually scan the entire contents of the cache. For several hundred thousand, this is quite acceptable, but if we put a large number of files in the cache, then this behavior of nginx is an unjustified expense of disk operations
    • for proxy_cache there is no native system of "cleaning" the cache, and third-party modules allow you to clear the cache only one file
    • there is a slight overhead in terms of CPU consumption, as at each return, MD5 hashing is performed on the line specified in the proxy_cache_key directive
    • But the most important thing for you is that proxy_cache does not care that the cache is updated with the least number of cycles of rewriting information. If the file "crashes" from the cache, then it is deleted and, if requested again, is re-written to the cache

    We will take a different approach to caching. An idea flashed at one of the hiload conferences. Create the cache0 and cache1 directories in the cache section 2. All proxy files are saved in cache0 (using proxy_store). nginx is forced to check the availability of the file (and give the file to the client) first in cache0 then in cache1 and if the file is not found, go to the storage server for the file, then save it to cache0.
    After some time (week / month / quarter), delete cache1, rename cache0 to cache1 and create an empty cache0. We analyze the access logs to the cache1 section and link those files that are requested from this section to cache0.

    This method can significantly reduce the write operations on the SSD, because Relinking a file is still less than completely rewriting a file. In addition, you can collect a raid from several SSDs, 1 of which will be SLC for metadata and MLC SSD for regular data. (On our system, metadata occupies approximately 30% of the total data volume) . When linking, only metadata will be overwritten!

    Nginx configuration example
    log_format cache0  '$request';
    # ...
    server {  
      expires max;
      location / {
        root /var/www/ssd/cache0/ ;
        try_files $uri @cache1;
        access_log off;
      location @cache1 {
        root /var/www/ssd/cache1;
        try_files $uri @storage;
        access_log /var/www/log_nginx/img_access.log cache0;
      location @storage {
        proxy_store on;
        proxy_store_access user:rw  group:rw  all:r;
        proxy_temp_path /var/www/img_temp/; # обязательно не на SSD!
        root /var/www/ssd/cache0/;
        access_log off;
    # ...

    Scripts for the rotation of cache0 and cache1
    I wrote several scripts on bash that will help you implement the previously described rotation scheme. If the size of your cache is measured in hundreds of gigabytes and the amount of content in the cache is in millions, then it is advisable to run the ria_ssd_cache_mover.sh script immediately after rotation several times in a row with the following command:
    for i in `seq 1 10`; do ria_ssd_cache_mover.sh; done;
    Set the time for this command to run experimentally. She worked for me almost a day. On the next day, set the launch of ria_ssd_cache_mover.sh on cron every hour.

    Protection from the DOS storage server
    If the storage server is weak and there are ill-wishers who are eager to strangle your system, you can use the secure_link module together with the described solution

    useful links

    UPD1: Nevertheless, I advise you to use the kernel> = 2.6.37 and older, because I have to 2.6.35 recently there was a big crash cache due to overcrowding on the SSD metadata. As a result, the alien will format several SSDs and reassemble the btrfs raid. :(