Improving Django HTTP Caching

    image
    This post will focus on HTTP caching ( translation ) and its use in conjunction with the Django framework. Few will argue with the assertion that the use of HTTP caching is a very correct and reasonable practice of developing web applications. However, it is in this functionality that Django contains a number of errors and inaccuracies that greatly limit the practical benefits of this approach. For example, bug # 15855 , instituted in April 2011, which can lead to very unpleasant errors in the operation of the web application , is still relevant .

    Middleware vs. explicit decorator


    Django has two standard ways to enable HTTP caching: either by activating UpdateCacheMiddleware / FetchFromCacheMiddleware , or by decorating the presentation function with the cache_page decorator . The first method has one significant drawback - it includes HTTP caching for all project views, but the second one contains the same bug # 15855. If not for this bug, then the option using cache_page would be preferable. Plus, this option is in good agreement with the most important postulate of The Zen of Python , that "explicit is better than implicit."

    The reason for the appearance of # 15855 lies in the Django request processing mechanism using the so-called middleware. Schematically, this mechanism is presented in the figure below.
    image

    The decorators for the views on the diagram are located together with the views themselves (view function), that is, after they are worked out, each middleware has the opportunity to further influence the final result (HttpResponse). For example, this is what SessionMiddleware does, adding a Vary header with the value “Cookie” in the response if there was a call to the session inside the view function (an ordinary case when working with authorized users). Failure to take into account the Vary header values ​​while saving the cache may cause the application user to receive data from the cache of another user. By the way, in the comments to the described bug there are examples of its solution specifically for the case of SessionMiddleware, but the problem is also relevant when using other middleware, for example, LocaleMiddleware,

    We fix a bug


    For a complete fix # 15855, you need to update the HttpResponse cache after all middleware has finished working. Now it’s clear why in the case of UpdateCacheMiddleware / FetchFromCacheMiddleware this error does not exist, because if we put UpdateCacheMiddleware above all the other middlewares, it is executed last and takes into account all the response headers. The only non-middleware way to implement a similar solution is to process the request_finished signal. But in this way there are two problems that need to be solved: firstly, the signal handler does not receive information about the current request / response, and secondly, the signal is sent after the response has been sent to the client. To update the cache, the second item is generally insignificant (we can update the cache even after sending the response), but we need to add our own headers in response - Expires and Cache-Control (the most important!), Which we cannot do if The request has already been processed.

    Before continuing, you should familiarize yourself with the source code of the original cache_page decorator .. As you can see, it is based on the same UpdateCacheMiddleware and FetchFromCacheMiddleware, which in general is not surprising, because they solve the same problems. We can do the same and write our own decorator, which will use slightly modified versions of the mentioned middleware:
    cache_page.py
    from django.utils import decorators
    from .middleware import CacheMiddleware
    def cache_page(**kwargs):
        """
        используется вместо оригинального django.views.decorators.cache.cache_page
        """
        cache_timeout = kwargs.get('cache_timeout')
        cache_alias = kwargs.get('cache_alias')
        key_prefix = kwargs.get('key_prefix')
        decorator = decorators.decorator_from_middleware_with_args(CacheMiddleware)(
            cache_timeout=cache_timeout,
            cache_alias=cache_alias,
            key_prefix=key_prefix,
        )
        return decorator
    


    middleware.py
    from django.middleware import cache as cache_middleware
    class CacheMiddleware(cache_middleware.CacheMiddleware):
        pass  # это будет middleware, в котором мы будем производить доработки
    


    To begin with, we will solve two existing problems with request_finished, which I spoke about earlier. We know for sure that only one request is processed in one thread at a time, which means that the current response to the user can be saved correctly in threading.local . We do this at the moment when the decorator is still in control in order to subsequently use request_finished in the handler. Thus, we can “kill two birds with one stone”: adding the Expires and Cache-Control headers before sending the response to the client and deferred caching, taking into account all possible changes:
    middleware.py
    import threading
    from django.core import signals
    from django.middleware import cache as cache_middleware
    response_handle = threading.local()
    class CacheMiddleware(cache_middleware.CacheMiddleware):
        def __init__(self, *args, **kwargs):
            super(CacheMiddleware, self).__init__(*args, **kwargs)
            signals.request_finished.connect(update_response_cache)
        def process_response(self, request, response):
            response_handle.response = response
            return super(CacheMiddleware, self).process_response(request, response)
    def update_response_cache(*args, **kwargs):
        """
        обработчик сигнала request_finished
        """
        response = getattr(response_handle, 'response', None)  # текущий response
        if response:
            try:
                pass  # сохранение response в кэш
            finally:
                response_handle.__dict__.clear()
    


    But in this simplest case, the cache will be saved twice, and for the first time without taking into account all the Vary values. Technically, this problem can be solved. Who cares, under the spoiler below is such a solution.
    middleware.py
    import contextlib
    import threading
    import time
    from django.core import signals
    from django.core.cache.backends.dummy import DummyCache
    from django.middleware import cache as cache_middleware
    from django.utils import http, cache
    response_handle = threading.local()
    dummy_cache = DummyCache('dummy_host', {})
    @contextlib.contextmanager
    def patch(obj, attr, value, default=None):
        original = getattr(obj, attr, default)
        setattr(obj, attr, value)
        yield
        setattr(obj, attr, original)
    class CacheMiddleware(cache_middleware.CacheMiddleware):
        def __init__(self, *args, **kwargs):
            super(CacheMiddleware, self).__init__(*args, **kwargs)
            signals.request_finished.connect(update_response_cache)
        def process_response(self, request, response):
            if not self._should_update_cache(request, response):
                return super(CacheMiddleware, self).process_response(request, response)
            response_handle.response = response
            response_handle.request = request
            response_handle.middleware = self
            with patch(cache_middleware, 'learn_cache_key', lambda *_, **__: ''):
                # заменяем функцию расчета ключа для кэша заглушкой (просто оптимизация)
                with patch(self, 'cache', dummy_cache):
                    # используем заглушку вместо драйвера кэша для того, чтобы
                    # отложить сохранение response в кэш до того момента,
                    # когда будут готовы все значения заголовка Vary,
                    # см. https://code.djangoproject.com/ticket/15855
                    return super(CacheMiddleware, self).process_response(request, response)
        def update_cache(self, request, response):
            with patch(cache_middleware, 'patch_response_headers', lambda *_: None):
                # мы не хотим патчить заголовки response повторно
                super(CacheMiddleware, self).process_response(request, response)
    def update_response_cache(*args, **kwargs):
        middleware = getattr(response_handle, 'middleware', None)
        request = getattr(response_handle, 'request', None)
        response = getattr(response_handle, 'response', None)
        if middleware and request and response:
            try:
                CacheMiddleware.update_cache(middleware, request, response)
            finally:
                response_handle.__dict__.clear()
    


    Eliminate other inaccuracies


    At the beginning, I mentioned that Django contains several errors in the HTTP caching mechanism, it is. And the bug solved above is not the only, though the most critical. Another inaccuracy of Django is that when reading a saved request from the cache, the value of the max-age parameter of the Cache-Control header is returned as it was at the time the response was saved to the cache, i.e. max-age may not correspond to the value of the Expires header due to the difference in time between these two events. And since browsers prefer to use Cache-Control instead of Expires, we get another error. Let's solve her. To do this, our middleware needs to redefine the process_request method:
    process_request
    def process_request(self, request):
        response = super(CacheMiddleware, self).process_request(request)
        if response and 'Expires' in response:
            # заменяем 'max-age' заголовка 'Cache-Control'
            # значением, подсчитанным при помощи 'Expires'
            expires = http.parse_http_date(response['Expires'])
            timeout = expires - int(time.time())
            cache.patch_cache_control(response, max_age=timeout)
        return response
    


    If there is no urgent need to certainly save all HTTP responses in the cache (and only the HTTP cache headers are needed), then instead of everything described above in the project settings, you can replace the main cache driver with a fake one (this solution also protects against consequences # 15855 ):
    CACHES = {
        'default': {
            'BACKEND': 'django.core.cache.backends.dummy.DummyCache',
        },
    }
    

    Further, it is not clear why, but UpdateCacheMiddleware, in addition to the standard Expires and Cache-Control, also adds Last-Modified and ETag headers. And this despite the fact that FetchFromCacheMiddleware does not process the corresponding requests in any way (with the headers If-Modified-Since, If-None-Match, etc.). There is a violation of the fundamental principle of a single obligation. I believe the calculation was that the developer will not forget to turn on ConditionalGetMiddleware or at least CommonMiddleware, the benefits of which are actually quite doubtful, and I never turn them on for my projects. Moreover, if something nevertheless returns 304 Not Modified (this happens, for example, when using the last_modified or etag decorators), then the caching headers (Expires and Cache-Control) will not get into this answer, which will cause the browser to return again and again (and get 304 Not Modified), despite the fact that we seem to have turned on HTTP caching, which should tell the browser that it makes no sense to go back within the specified time. We eliminate this inaccuracy in "process_response":
    process_response
    def process_response(self, request, response):
        if not self._should_update_cache(request, response):
            return super(CacheMiddleware, self).process_response(request, response)
        last_modified = 'Last-Modified' in response
        etag = 'ETag' in response
        if response.status_code == 304:
            # добавляем в ответ Not Modified заголовки Expires и Cache-Control
            cache.patch_response_headers(response, cache_timeout)
        else:
            response_handle.response = response
            response_handle.request = request
            response_handle.middleware = self
            with patch(cache_middleware, 'learn_cache_key', lambda *_, **__: ''):
                # заменяем функцию расчета ключа для кэша заглушкой (просто оптимизация)
                with patch(self, 'cache', dummy_cache):
                    # используем заглушку вместо драйвера кэша для того, чтобы
                    # отложить сохранение response в кэш до того момента,
                    # когда будут готовы все значения заголовка Vary,
                    # см. https://code.djangoproject.com/ticket/15855
                    response = super(CacheMiddleware, self).process_response(request, response)
        if not last_modified:
            # удаляем заголовок Last-Modified, если его не было до запуска метода
            del response['Last-Modified']
        if not etag:
            # удаляем заголовок ETag, если его не было до запуска метода
            del response['ETag']
        return response
    


    Here it’s worth a little clarification that if we want Expires and Cache-Control headers to be added to the 304 Not Modified response, then the last_modified and etag decorators should come after cache_page, otherwise the last one will not have a chance to process answers of this type:
    @cache_page(cache_timeout=3600)
    @etag(lambda request: 'etag')
    def view(request):
        pass
    

    Adding useful features


    Having eliminated all the shortcomings, you suddenly realize that in the resulting solution there is really not enough opportunity to set the calculated (on-demand) value of the caching time, especially if you look at the last_modified and etag decorators, where such a possibility exists.

    And that's not all. I also want to somehow somehow invalidate the cache, for example, when changing the returned entity. This is most conveniently done by automatically changing the key for the cache, that is, you also want to set the key not statically, but calculate on-demand.

    The easiest and most elegant way to realize both of these needs is to set the necessary parameters in the form of a “lazy” (lazy) expression:
    from django.utils.functional import lazy
    @cache_page(
        cache_timeout=lazy(lambda: 3600, int)(),
        key_prefix=lazy(lambda: 'key_prefix', str)(),
    )
    def view(request):
        pass
    

    In this case, the function passed as an argument to lazy will only be executed (and always) when an attempt is made to access the expression in the context of the types indicated by the subsequent arguments.

    Another, more flexible way is the ability to pass as functions for cache_timeout and key_prefix ordinary functions with a signature corresponding to the presentation function:
    @cache_page(
        cache_timeout=lambda request, foo: 3600,
        key_prefix=lambda request, foo: 'key_prefix',
    )
    def view(request, foo):
        pass
    

    Such an option would allow calculating cache_timeout and key_prefix based on the request itself and its parameters, but requires one more refinement. In order not to bother the reader with large chunks of source code, I just give a link to the component, where this and all of the above are already implemented as a separate Python module: django-cache .

    Conclusion


    I did not mention yet another useful feature, which would be nice to have, about the ability of the client to force the server to skip the cache, so that it sends the latest data to the client’s request. This is done using the Cache-Control request header: max-age = 0. There is no such possibility in django-cache yet, but perhaps in the future such an option will appear.

    UPD : the mentioned option still appeared .

    Anticipating questions on the topic of why all corrections and new features can not be immediately attributed to Django, I will answer that I plan to do this in the near future. But new features will only be included in the next version of Django, most likely already in 1.11, and django-cache already knows how to work with all the latest versions (starting from 1.8). Although bug fixes are added, as a rule, to all currently supported branches.

    Another bug


    When the note was already being prepared for publication, on one of the projects I found another inaccuracy in the Django query caching functionality. Its essence is that for the so-called conditional requests (containing the If-Modified-Since headers, etc.), cache_page always tries to get the result from the cache and, if successful, returns a response with the code 200. This behavior is undesirable in cases where the handler A request may return 304 Not Modified. Fix code here .

    UPD : in fact, you can do without threading.local and signals if you add a special “callback” to the response._closable_objects list , which will save the response to the cache after all middleware has finished working.

    Also popular now: