DmitryKoterov October 16, 2009 at 01:50

Pitfalls when using caching in nginx

The web server and reverse-proxy nginx have very powerful features for caching HTTP responses. However, in some cases, documentation and examples are not enough, as a result, not everything turns out as easy and simple as we would like. For example, my nginx configs are written in blood in some places. With this article I will try to improve the situation a little.

In this article: a) pitfalls with full-page caching; b) caching with rotation; c) creating a dynamic “window” in a cached page.

I will assume that you are using the nginx + fastcgi_php bundle. If you are using nginx + apache + mod_php, just replace the directive names with fastcgi_cache * with proxy_cache *

If I choose whether to cache the page on the PHP side or on the nginx side, I select nginx. Firstly, it allows you to send 5-10 thousand requests per second without any difficulties and without smart talk about the "high load". Secondly, nginx independently monitors the size of the cache and cleans it both in case of obsolescence and when crowding out infrequently used data.

Whole page caching

If the main page on your site, although dynamically generated, is rarely changed, you can greatly reduce the load on the server by caching it in nginx. With high traffic, even short-term caching (5 minutes or less) already gives a huge increase in performance, because the cache works very fast. Even after caching the page for only 30 seconds, you will still achieve significant server load while maintaining the dynamism of updating the data (in many cases, updating every 30 seconds is enough).

For example, you can cache the main page like this:

fastcgi_cache_path / var / cache / nginx levels = keys_zone = wholepage: 50m;
...
server {
  ...
  location / {
    ...
    fastcgi_pass 127.0.0.1:9000;
    ...
    # Turn on caching and carefully select the cache key.
    fastcgi_cache wholepage;
    fastcgi_cache_valid 200 301 302 304 5m;
    fastcgi_cache_key "$ request_method | $ http_if_modified_since | $ http_if_none_match | $ host | $ request_uri";
    # We guarantee that different users will not receive the same session cookie.
    fastcgi_hide_header "Set-Cookie";
    # Make nginx cache the page anyway, regardless
    # caching headers exposed in PHP.
    fastcgi_ignore_headers "Cache-Control" "Expires";
  }
}

I will not greatly exaggerate if I say that every line in this config is written in blood. There are many pitfalls, let's look at them all.

fastcgi_cache_path: ease of debugging is also important

fastcgi_cache_path / var / cache / nginx levels = keys_zone = wholepage: 50m;

In the fastcgi_cache_path directive, I set the "empty" value for levels. Although this slightly reduces performance (files will be directly created in / var / cache / nginx, without splitting into directories), but it makes it much easier to debug and diagnose cache problems. Believe me, you will have to climb into / var / cache / nginx and watch what is stored there more than once.

fastcgi_cache_valid: cache response code 304 too

fastcgi_cache_valid 200 301 302 304 5m;

In the fastcgi_cache_valid directive we force to cache not only the standard codes 200 OK, 301 Moved Permanently and 302 Found, but also 304 Not Modified. Why? Let's remember what 304. means. It is issued with an empty response body in two cases:

If the browser sent an “If-Modified-Since: date” header, in which date is greater than or equal to the value of the “Last-Modified: date” response header. Those. the client asks: “Is there a new version from date? If not, return me 304 and save traffic. If there is, give me the body of the page. ”
If the browser sent an “If-None-Match: hash” header, where hash matches the response header value “ETag: hash”. Those. the client asks: “Is the current version of the page different from the one I requested last time? If not, return me 304 and save traffic. If so, give the body of the page. "

In both cases, Last-Modified or ETag will most likely be taken from the nginx cache, and verification will be very fast. We don’t need to “pull” PHP just to let the script display these headers, especially in light of the fact that it will be returned from the cache to clients who will receive a 200 response.

fastcgi_cache_key: working carefully with dependencies

fastcgi_cache_key "$ request_method | $ http_if_modified_since | $ http_if_none_match | $ host | $ request_uri";

Of particular note is the value in the fastcgi_cache_key directive. I have given the minimum working value of this directive. A step to the right, a step to the left, and in some cases you will begin to receive "incorrect" data from the cache. So:

Dependence on $ request_method we need, because HEAD requests on the Internet are quite common. The response to a HEAD request never contains a body. If you remove the dependency on $ request_method, it may be so that someone before you requested the main page with the HEAD method, and then you get empty content on GET.
The dependency on $ http_if_modified_since is needed so that the cache with the 304 Not Modified response is not accidentally given to the client making a regular GET request. Otherwise, the client may receive an empty response from the cache.
Same thing with $ http_if_none_match. We must be insured against giving blank pages to customers!
Finally, the dependency on $ host and $ request_uri requires no comment.

fastcgi_hide_header: resolving security issues

fastcgi_hide_header "Set-Cookie";

The fastcgi_hide_header directive is very important. Without it, you seriously risk security: users can get other people's sessions through a session cookie in the cache. (True, in the latest versions of nginx, something was done towards automatically taking this factor into account.) Do you understand how this happens? Vasya Pupkin visited the site, he was given a session and session cookie. Let the cache be empty at that time, and Vasina Cookie will be written into it. Then another user came, received a response from the cache, and in it - Vasya’s Cookie. And that means his session too.

You can, of course, say: let's not call session_start () on the main page, then there will be no problems with cookies. In theory, this is true, but in practice this method is very unstable. Sessions often start “postponed”, and it is enough for any part of the code to “accidentally” call a function that requires access to the session, as we get a security hole. And safety is such a thing that if a hole can be caused by negligence in a particular technique, then this technique is considered to be “leaky” by definition. In addition, there are other cookies, except for session cookies; they also do not need to be written to the cache.

fastcgi_ignore_headers: do not let the site “lie down” from the load during a typo

fastcgi_ignore_headers "Cache-Control" "Expires";

The nginx server pays attention to the Cache-Control, Expires and Pragma headers that PHP produces. If they say that the page does not need to be cached (or that it is already outdated), then nginx does not write it to the cache file. This behavior, although it seems logical, in practice creates a lot of difficulties. Therefore, we block it: thanks to fastcgi_ignore_headers, the contents of any page, regardless of its headings, will go into cache files.

What are these difficulties? They are again associated with sessions and the session_start () function, which in PHP by default sets the headers “Cache-Control: no-cache” and “Pragma: no-cache”. There are three solutions to the problem:

Do not use session_start () on the page where caching is supposed. One of the drawbacks of this method we have already discussed above: one careless movement is enough, and your site, which receives thousands of requests per second to the cached main page, will instantly “lie down” when the cache is turned off. The second minus - we will have to manage the caching logic in two places: in the nginx config and in the PHP code. Those. this logic will be “spread out” over completely different parts of the system.
Set ini_set ('session.cache_limiter', ''). This will force PHP to prohibit the output of any headers that limit caching when working with sessions. The problem here is the same: the "blurry" caching logic, because ideally we would like all caching to be managed from a single place.
Ignore cache ban headers when writing to cache files using fastcgi_ignore_headers. This seems to be a win-win solution, which is why I recommend it.

Rotary Caching

A static homepage is not so interesting. What to do if there are a lot of materials on the site, and the Main acts as a kind of "showcase" for them? On such a “window” it is convenient to display “random” materials so that different users see different things (and even one user receives new content by reloading the page in the browser).

The solution is caching with rotation:

We force the script to honestly return elements to the main page in random order, performing the necessary queries to the database (albeit slowly).
Then we store not one, but, say, 10 page variations in the cache.
When a user visits the site, we show him one of these options. In this case, if the cache is empty, then the script is launched, and if not, the result is returned from the cache.
We set the cache aging time to small (for example, 1 minute) so that different users “look through” all the materials on the site in a day.

As a result, the first 10 requests to the script-generator will be executed “honestly” and “load” the server. But then they "settle" in the cache and within a minute will be issued already quickly. The increase in productivity is the greater, the more visitors to the site.

Here is a nginx config piece that implements rotation caching:

fastcgi_cache_path / var / cache / nginx levels = keys_zone = wholepage: 50m;
perl_set $ rand 'sub {return int rand 10}';
...
server {
  ...
  location / {
    ...
    fastcgi_pass 127.0.0.1:9000;
    ...
    # Turn on caching and carefully select the cache key.
    fastcgi_cache wholepage;
    fastcgi_cache_valid 200 301 302 304 1m;
    fastcgi_cache_key "$ rand | $ request_method | $ http_if_modified_since | $ http_if_none_match | $ host | $ request_uri";
    # We guarantee that different users will not receive the same session cookie.
    fastcgi_hide_header "Set-Cookie";
    # Make nginx cache the page anyway, regardless
    # caching headers exposed in PHP.
    fastcgi_ignore_headers "Cache-Control" "Expires";
    # We force the browser to reload the page each time (for rotation).
    fastcgi_hide_header "Cache-Control";
    add_header Cache-Control "no-store, no-cache, must-revalidate, post-check = 0, pre-check = 0";
    fastcgi_hide_header "Pragma";
    add_header Pragma "no-cache";
    # We always issue fresh Last-Modified.
    expires -1; # Attention!!! This expires line is needed!
    add_header Last-Modified $ sent_http_Expires;
  }
}

You may notice that compared to the previous example, I had to add 6 more directives in location. They are all very important! But let's not get ahead of ourselves, we will consider everything in order.

perl_set: randomizer dependency

perl_set $ rand 'sub {return int rand 10}';

The perl_set directive is simple. We create a variable, using which nginx will call the function of the Perl interpreter built into it. According to the author of nginx, this is a fairly quick operation, so we won’t “save on matches”. The variable takes a random value from 0 to 9 in each of the HTTP requests.

fastcgi_cache_key: randomizer dependency

fastcgi_cache_key "$ rand | $ request_method | ...";

Now we are mixing the randomizer variable into the cache key. The result is 10 different caches for the same URL, which we needed. Due to the fact that the script called during a cache miss returns elements of the main page in random order, we get 10 varieties of the main page, each of which “lives” for 1 minute (see fastcgi_cache_valid).

add_header: force turn off browser cache

fastcgi_hide_header "Cache-Control";
add_header Cache-Control "no-store, no-cache, must-revalidate, post-check = 0, pre-check = 0";
fastcgi_hide_header "Pragma";
add_header Pragma "no-cache";

We said above that nginx is sensitive to cache headers output by a PHP script. If the PHP script returns the headers "Pragma: no-cache" or "Cache-Control: no-store" (as well as some more, for example, "Cache-Control: no-save, no-issue, me-here-not- was, I-didn’t-say-whose-this-hat ”), then nginx will not save the result in cache files. Specially to suppress this behavior, we use fastcgi_ignore_headers (see above).

What is the difference between Pragma: no-cache and Cache-Control: no-cache? Just because Pragma is a legacy of HTTP / 1.0 and is now supported for compatibility with older browsers. HTTP / 1.1 uses Cache-Control.

However, there is still a cache in the browser. And in some cases, the browser may not even try to make a request to the server to display the page; instead, he will get it from his own cache. Because we have a rotation, this behavior is inconvenient for us: after all, every time a user visits a page, he must see new data. (In fact, if you still want to cache one option, you can experiment with the Cache-Control header.)

The add_header directive just passes the caching ban header to the browser. Well, to prevent this header from accidentally multiplying, we first remove from the HTTP response what the PHP script wrote there (and what was written into the nginx cache): the fastcgi_hide_header directive. After all, when you write the nginx config, you don’t know what PHP will decide to output there (and if session_start () is used, it will definitely decide). Suddenly he will set his own Cache-Control header? Then there will be two of them: PHP-shny and added by us through add_header.

expires and Last-Modified: guarantee page reload

expires -1; # Attention!!! This expires line is needed!
add_header Last-Modified $ sent_http_Expires;

Another trick: we must set Last-Modified to the current time. Unfortunately, in nginx there is no variable that stores the current time, but it appears magically if you specify the expires -1 directive.

Although this is currently not documented (October 2009), nginx creates variables of the form $ sent_http_XXX for each XXX response header sent to the client. We use one of them.

Why is it so important to set this heading in the current time? Everything is pretty simple.

Let's imagine that PHP issued the header "Last-Modified: some_date".
This header will be written to the nginx cache file (you can check: in our example, the files are stored in / var / cache / nginx), and then sent to the client browser.
The browser will remember the page and its modification date ...
... therefore, the next time a user logs on to the site, the HTTP request will have a question-heading “If-Modified-Since: some_date”.
What will nginx do? He will take the page out of his cache, parse its headers and compare Last-Modified with If-Modified-Since. If the values match (or the first is less than the second), then nginx will return a “304 Not Modified” response with an empty body. And the user will not see any rotation: he will receive what he has already seen before.

In fact, the big question is how the browser will behave when both Last-Modified and Cache-Control no-cache are present. Will he make an If-Modified-Since request? Different browsers seem to behave differently here. Experiment.

There is one more reason to set Last-Modified manually. The fact is that the PHP function session_start () forces the Last-Modified header to be issued, but indicates in it ... the time of the change to the PHP file that first received control. Therefore, if all the requests on your site go to the same script (Front Controller), then your Last-Modified will almost always be equal to the time it took for this single script to change, which is completely wrong.

Dynamic “window” in cached page

And finally, I’ll mention one technique that can be useful in the light of caching. If you want to cache the main (or any other) page of the site, but one small block, which must be dynamic, interferes, use the module for working with SSI.

In the part of the page that should be dynamic, insert the following HTML comment:

In terms of the nginx cache, this comment is plain text. It will be saved in the cache file as a comment. However, later, upon reading the cache, the SSI nginx module will work, which will refer to the dynamic URL. Of course, the address / get_user_info / should have a PHP handler that returns the contents of this block. In more detail this method is described in this article from Habr.

Well and, of course, do not forget to enable SSI for this page or even for the entire server:

ssi on;

The SSI include directive has another, extremely important property. When several such directives are found on the page, then all of them begin to be processed simultaneously, in parallel mode. So, if you have 4 blocks on a page, each of which loads 200ms, the total page will be received by the user after 200ms, and not after 800.

The source code of this article can be read here: http://dklab.ru/chicken/nablas /56.html

Tags: