
Distribution of static content - bill in milliseconds

8 years ago, I wrote an article about accelerating the distribution of static content , it appealed to some habrachitelami and remained relevant for a long time.
And so we decided to accelerate what is already working quickly and, at the same time, share the experience of what happened in the end. Of course, I’ll talk about the rake, where HTTP / 2 is not necessary , why we buy 7.6Tb NVMe SSD instead of 8x1Tb SATA SSD and a lot of other highly specialized information.
Let's immediately agree that the storage and distribution of content is 2 different tasks and we will only talk about distribution (advanced cache)
Let's start with iron ...
NVMe SSD
As you already understood, we will keep up with the progress and use the modern 7.68 TB SSD HGST Ultrastar SN260 {HUSMR7676BHP3Y1} NVMe, HH-HL AIC to store the cache. Screw tests show not as beautiful a picture as in marketing materials, but quite optimistic
[root@4 www]# hdparm -Tt --direct /dev/nvme1n1
/dev/nvme1n1:
Timing O_DIRECT cached reads: 2688 MB in 2.00 seconds = 1345.24 MB/sec
Timing O_DIRECT disk reads: 4672 MB in 3.00 seconds = 1557.00 MB/sec
[root@4 www]# hdparm -Tt /dev/nvme1n1
/dev/nvme1n1:
Timing cached reads: 18850 MB in 1.99 seconds = 9452.39 MB/sec
Timing buffered disk reads: 4156 MB in 3.00 seconds = 1385.08 MB/sec
Of course, you select the size and manufacturer "for yourself", you can consider the SATA interface, but we are writing about what we should strive for :)
If you nevertheless opted for NVMe, we will install the nvme-cli package and get the screw info We look at the characteristics of our "workhorse"
[root@4 www]# nvme smart-log /dev/nvme1n1
Smart Log for NVME device:nvme1n1 namespace-id:ffffffff
critical_warning : 0
temperature : 35 C
available_spare : 100%
available_spare_threshold : 10%
percentage_used : 0%
data_units_read : 158 231 244
data_units_written : 297 968
host_read_commands : 45 809 892
host_write_commands : 990 836
controller_busy_time : 337
power_cycles : 18
power_on_hours : 127
unsafe_shutdowns : 14
media_errors : 0
num_err_log_entries : 10
Warning Temperature Time : 0
Critical Composite Temperature Time : 0
Temperature Sensor 1 : 35 C
Temperature Sensor 2 : 27 C
Temperature Sensor 3 : 33 C
Temperature Sensor 4 : 35 C
As you can see, the screw feels great, looking ahead to say that under load the temperature regime is maintained in the same range.
In peak periods, we give out about 4000 / photo per second (photo size is about 10-100K), at half this load iostat does not rise more than 0.1% , RAM is also very important there, but I'll write about it)
A few words about why now we are betting on expensive NVMe, instead of a bag of cheap SATA SSDs. We conducted tests that show that with a similar server architecture, RAM capacity and the same load, for example, the Ultrastar SN260 7.68TB NVMe works with 10 times less iowait than the 8x Samsung Samsung SSD 850 PRO 1TB in Stripped RAID with Areca ARC-1882 Raid Controller PCI. On servers there is a slight difference in the number of cores with NVMe 26, with ARC-1882 24 cores, the number of RAM 128G there and there. Unfortunately, there is no way to compare the power consumption on these servers. I had the opportunity to measure the power consumption of an NVMe platform with a similar system for AMD processors with software Stripped RAID from 8xINTEL SSDSC2BB480G4 480G and an ARC-1680 Raid Controller PCI on 24 AMD Opteron (tm) Processor 6174 cores, under the same load, the new system “eats” 2.5 times less energy than 113 Watts versus 274 Watts on AMD. Well, the CPU and iowait load there are also an order of magnitude smaller (there is no hardware encryption on AMD)
File system
8 years ago, we used btrfs , tried XFS, but ext4 behaves more lively with a large parallel load, so our choice is ext4. Proper ext4 tuning can further enhance the already excellent performance of this fs.
Optimization starts from the moment of formatting, for example, if you mainly distribute 1-5K files, you can slightly reduce the block size when formatting:
mkfs.ext4 -b 2048 /dev/sda1
or even
mkfs.ext4 -b 1024 /dev/sda1
In order to find out the current block size on the file system, use:
tune2fs -l /dev/sda1 | grep Block
or
[root@4 www]# fdisk -l /dev/nvme1n1
Диск /dev/nvme1n1: 7681,5 ГБ, 7681501126656 байтів, 1875366486 секторів
Одиниці = секторів з 1 * 4096 = 4096 байтів
Розмір сектора (логічного/фізичного): 4096 байтів / 4096 байтів
Розмір введення-виведення (мінімальний/оптимальний): 4096 байтів / 4096 байтів
Тип мітки диска: dos
Ідентифікатор диска: 0x00000000
Since the size of the sector is 4k on our NVMe SSD, making the block size less than this value is not optimal, we format it:
mkfs.ext4 /dev/nvme1n1
Please note that I did not partition the disk, but format it with an entire block device. For OS, I use a different SSD and there is a breakdown there. It can be mounted in the file system as / dev / nvme1n1. It is desirable to
mount the disk cache in such a way as to get the maximum speed out of it, for this we turn off all unnecessary, we will mount it with the option “noatime, barrier = 0” , if the atime attribute is important to you , in kernel 4 and higher there is the lazytime option , it holds the atime value in RAM and partially solves the problem of frequent access time updates when reading.
RAM
If your statics is placed in RAM, forget about all of the above and enjoy distributing files from RAM.
When distributing from RAM, the OS itself draws frequently requested files into its cache and you do not need to configure anything for this, but if the device on which the files are stored is slow and there are a lot of files (hundreds of thousands), it may happen that the OS simultaneously requests a lot files from the file system and your storage device at the same time will begin to greatly “slow down”. You can try to solve this problem as follows, load the statics onto the RAM disk and distribute from it.
Example of creating a RAM disk:
mount -t tmpfs -o size=1G,mode=0700,noatime tmpfs /cache
if you forgot what parameters you mounted, then you can look with findmnt:
findmnt --target /cache
You can remount without rebooting:
mount -o remount,size=4G,noatime /cache
You can also combine part in RAM (some frequently requested previews) the rest on SSD.
In nginx, it will look something like this:
location / {
root /var/cache/ram;
try_files $uri @cache1;
}
location @cache1 {
root /var/cache/ssd;
try_files $uri @storage;
}
If the content does not fit into RAM, you have to trust the kernel, we put 128G RAM with the size of the active cache 3-5G and are thinking about increasing to 256G.
CPU
There are no special requirements for the processor frequency, rather there is a requirement for functionality: if your traffic will need to be encrypted (for example, via https), it is important to choose a processor that will support hardware-based AES-NI (Intel Advanced Encryption Standard) encryption.
On Linux, you can check the AES-NI instruction processor support for the command
grep -m1 -o aes /proc/cpuinfo
aes
If there is no “aes” output, then the processor does not support such instructions and encryption will devour processor performance.
Configure nginx
The general ideas are described in the previous article , but still there are several points for optimization, I'll tell you about them now:
- We avoid rewrites at the top level of the server directive, we do not welcome regular expressions in the location sections. If without this in any way, then try to at least wrap the regular season in a regular location, for example:
location /example/dir { rewrite ^/example/dir(.*) /newexample/$1; }
If this is not done, then the regex will be applied for each request to the distribution system, and in our example, the redirect regex will be tried on only when the path to the photo starts at / example / dir - When storing content in a cache, follow the rule not to store more than 255 files or folders in one folder (if you formatted a disk with the default settings with a block size of 4K), 128 if 2K, etc.
- We put a new core, preferably not lower than 4.1. In nginx, do not forget to enable SO_REUSEPORT support. This directive has a positive effect on the parallel downloading of files from the distribution server.
- We place the server closer to your users, so we make users happier and they know about it and appreciate it by search engines
How about a CDN
CDN is good if you work in different countries, you have a lot of content and little traffic, but if there is a lot of traffic, there is little content and you work for a specific country, it makes sense to calculate everything and understand what is more profitable for you. For example, we work on the Ukrainian market, many world leaders providing CDN service do not have servers in Ukraine and distribution is from Germany or Poland. So we get instead of + 3-5ms - + 30-50ms to the answer out of the blue. Placing a 2U server in a good Ukrainian DC starts at $ 18 + payment per channel, for example, 100Mbps - $ 10 total $ 28. The commercial price for distributing CDN traffic in Ukraine is about 0.05 $ / G, i.e. if we distribute more than 560G / month, then we can already consider the option of self-distribution. RIA.com services distribute several terabytes of statics per day, and therefore we have long made decisions on self-distribution.
How to make friends with search engines
For many search engines, important characteristics are TTFB (time to first byte) and how close the content is to the one who is looking for it, in addition, the texts in the links to the content, the description in the Exif tags, the uniqueness, the size of the content, etc. .d.
All of which, I write here, mainly in order to extend TTFB and be closer to the user. You can go for a trick and use User-Agent to detect search bots and give them content from a separate server to eliminate “congestion” or “slowdowns during peak periods” (usually bots give a uniform load), thus making search engines and users not happy . We don’t do this, besides there is a suspicion that Google Chrome and Yandex Browser trust the information that browsers supply about the speed of loading your pages from a client’s position.
It is also worth noting that the load from different bots can be so significant that you have to spend almost half of the resources on servicing these bots. RIA.com projects serve about 10-15 million requests from bots per day (this includes calls not only to statics but also to regular pages), this is not much less than the number of requests from real users.
Optimization for static content distribution
Well, if the distribution process is already set up - it's time to think about what you can do with the content itself, so that it is more accessible, loads faster, takes up less space, was attractive to search engines.
- The first thing you should pay attention to is the photo format: it turns out that jpeg, which is very popular for many projects, is already inferior in size with comparable quality to newer formats on the Web, for example, the WebP format that Google has been promoting since 2010. According to various sources, we get 20-30% smaller size, with equal quality. In this case, you can also use a special tag on the client, which allows you to describe several formats that the browser can display, and if the browser does not support the WebP format, it loads, for example, jpeg.
Also a little about SEO requirements:- You already made part of the SEO optimization by placing the distribution server closer to the client and accelerating its return
- A few words about exif tags, many of them are cut out when scaling photos - in vain! Google also analyzes this information, why not indicate for the photo what is shown in the photo in the ImageDescription exif tag, or in the Copyright exif tag not to write the copyright to your content, unless, of course, it’s yours :)
- Do not forget about the HTTP headers, which indicate various useful meta information about the content file. For example, using the Expires header, you can specify how long content is stored in the browser cache.
If your site runs over HTTP / 2, then you should experiment with the following optimizations:- You can refuse to use css-sprites, since multiplexing the transfer of small files on one connection can compensate for the gain that is usually achieved via http / 1.1 using this optimization
- try web-push technology , note that this method of optimizing downloads does not take into account the presence of content in the browser cache that is pushed, but this is solved by using cookies and a simple nginx configuration, as shown in the example
server { listen 443 ssl http2 default_server; ssl_certificate ssl/certificate.pem; ssl_certificate_key ssl/key.pem; root /var/www/html; http2_push_preload on; location = /demo.html { add_header Set-Cookie "session=1"; add_header Link $resources; } } map $http_cookie $resources { "~*session=1" ""; default "; as=style; rel=preload, ; as=image; rel=preload, ; as=style; rel=preload"; }
- From our experience, it is better to distribute content with a domain name different from the main domain of the project.
Mysterious HTTP / 2
As you know, HTTP / 2 multiplexes a single connection through which several files are transmitted, while prioritizing the sending of one file or another, compression of headers, and other advantages of the new protocol are possible, but there are also disadvantages that few write about. I’ll go from far away: maybe there is someone older from the hawkers, who remembers the Internet before the era of uTorrent, many of you had to use special rocking-horses flashget, download master, etc. Remember how they worked? They downloaded one file in 6 or 8 threads, opening 6-8 connections to the sending server. Why did they do that? After all, a channel with a distributing and receiving server should not depend on the number of connections between them, but in fact this dependence exists if the channel is bad, with packet loss and packet transmission errors. It turns out that in these situations, swings faster in multiple threads. In addition, if a channel is used by several clients, then increasing the number of connections from one client helps to get more bandwidth in the channel and drag the “blanket of resources” onto itself. Of course, this does not always happen, but there is still a threat to get a “grabbing” competitor in the form of a browser working using the http / 1.1 protocol, which will open 6 konekshin to one site, instead of 1 via http2. In my practice, there was a case with a site such as “photo hosting of the desktop wallpaper” that refused http / 2, because the site on the http / 2 protocol slowed down noticeably, without visible load on the server itself, the guys stayed on https, but switched on http / 1.1 and the situation was resolved. In addition, if a channel is used by several clients, then increasing the number of connections from one client helps to get more bandwidth in the channel and drag the “blanket of resources” onto itself. Of course, this does not always happen, but there is still a threat to get a “grabbing” competitor in the form of a browser working using the http / 1.1 protocol, which will open 6 konekshin to one site, instead of 1 via http2. In my practice, there was a case with a site such as “photo hosting of the desktop wallpaper” that refused http / 2, because the site on the http / 2 protocol slowed down noticeably, without visible load on the server itself, the guys stayed on https, but switched on http / 1.1 and the situation was resolved. In addition, if a channel is used by several clients, then increasing the number of connections from one client helps to get more bandwidth in the channel and drag the “blanket of resources” onto itself. Of course, this does not always happen, but there is still a threat to get a “grabbing” competitor in the form of a browser working using the http / 1.1 protocol, which will open 6 konekshin to one site, instead of 1 via http2. In my practice, there was a case with a site such as “photo hosting of the desktop wallpaper” that refused http / 2, because the site on the http / 2 protocol slowed down noticeably, without visible load on the server itself, the guys stayed on https, but switched on http / 1.1 and the situation was resolved. but there is still a threat to get a “grabbing” competitor in the form of a browser working using the http / 1.1 protocol, which will open 6 konekshin to one site, instead of 1 via http2. In my practice, there was a case with a site such as “photo hosting of the desktop wallpaper” that refused http / 2, because the site on the http / 2 protocol slowed down noticeably, without visible load on the server itself, the guys stayed on https, but switched on http / 1.1 and the situation was resolved. but there is still a threat to get a “grabbing” competitor in the form of a browser working using the http / 1.1 protocol, which will open 6 konekshin to one site, instead of 1 via http2. In my practice, there was a case with a site such as “photo hosting of the desktop wallpaper” that refused http / 2, because the site on the http / 2 protocol slowed down noticeably, without visible load on the server itself, the guys stayed on https, but switched on http / 1.1 and the situation was resolved.
I would also experiment with receiving content from different domains (but actually from one server), this technique is called domain sharding, it is possible how to get out of the situation without giving up http / 2 and forcing the browser to set up as many connections as the site administrator needs.Instead of a conclusion
The moment when the site starts to slow down cannot be called unpleasant, because you feel that traffic is growing, there are more customers. We never set ourselves the task of avoiding “brakes”; we learned to quickly respond to this growth. Performance optimization is an endless process, do not deny yourself the pleasure of competing with your competitors in the art of being fast!