Optimization of the HTTP server through resource versioning. Implementation Features
- The essence of optimization
- Page load vs forced refresh
- Need automation
- Server side implementation
- Server side optimization
- Features google app engine
- Source
- Summary
An example implementation for Google App Engine / Python is considered.
The essence of optimization
Yahoo engineers in their famous article wrote about an interesting technique for optimizing HTTP processing through file versioning. The essence of it is ... Usually they write in HTML simply:
< img src="image.jpg" >
Once having acquired image.jpg in the cache, after the browser reads the HTML again and again finds a link to the same image there. In general, the browser cannot independently understand whether it was updated on the server, so it has to send a request to the server.
To avoid an unnecessary request, you can specify the version of the resource in its address, making the address unique:
< img src="image.v250.jpg" >
Thus, the browser can be sure that the file of version No. 250 will not change in the future, and No. 251 too; and if No. 250 is in the cache, then you can use it without any questions to the server. Two HTTP headers will help give this confidence to the browser:
// Будте спокойны, картинка не обновится никогда
Expires: Fri, 30 Oct 2050 14:19:41 GMT
// и может хранится в кэше вечность
Cache-Control: max-age=12345678, public
Thus, to view a page for the umpteenth time, you only need to download HTML, and access to numerous resources is no longer required.
Page Load vs. Forced Refresh
In its current form, this optimization works for following links and for Ctrl + L, Enter. But if the user refreshes the current page through F5, the browser forgets that the resources indicated “no longer bother”, and “extra” requests are sent to the server, one for each resource. You can’t change this behavior of browsers anymore, but what you can and should do is not to give the files each time to the full program, but to introduce additional logic, if possible trying to answer “I haven’t changed anything, take it from your cache”.
When the browser requests “image.v250.jpg”, then if it has a copy in the cache, the browser sends the header “If-Modified-Since: Fri, 01 Jan 1990 00:00:00 GMT”. The browser that came for this picture for the first time, such a header is notsends. Accordingly, the north should say to the first “nothing has changed”, and honestly give the picture to the second. Specifically, in our case, the date can not be analyzed - the fact of the presence of the image in the cache is important, and the image is correct there (due to the versioning of the files and unique URLs).
But just like that, the “If-Modified-Since” header will not come to the server, even if the picture is in the cache. To force the browser to send this header, in the (chronologically) previous answer, you had to give the header “Last-Modified: Fri, 01 Jan 1990 00:00:00 GMT”. In practice, this only means that the server should always give this header. You can give an honest date of the last file change, or you can specify any date in the past - the same date will then go back to the server, and there, as it turned out, it is not of special interest.
In fact, the optimization described in this section does not have a direct relationship with Yahoo, but should be used in conjunction to avoid unnecessary workloads. Otherwise, the effect will be incomplete.
Need automation
The technique is not bad, but it is practically impossible to arrange versions of files manually in practice. In GAE / django, the problem is solved through custom tags. The code is written in the template:
< img src="{% static 'image.jpg' %}" >
converted to HTML:
< img src="/never-expire/12345678/image.jpg" >
And here is the implementation of such a tag:
def static(path):
return StaticFilesInfo.get_versioned_resource_path(path)
register.simple_tag(static)
Server side implementation
Basically, this optimization is convenient for processing static files - pictures, css, javascript. But App Engine processes files designated as static itself
First, the GET request handler checks that the requested version of the file matches the latest. If it does not match, it redirects to a new address, for order:
# Some previous version of resource requested - redirect to the right version
correct_path = StaticFilesInfo.get_resource_path(resource)
if self.request.path != correct_path:
self.redirect(correct_path)
return
Then it sets the response headers:
- Content-Type according to the file extension
- Expires, Cache-Control, Last-Modified as already described.
If the If-Modified-Since header is seen in the request, we do nothing and set the code to 304 - the resource has not changed. Otherwise, the contents of the file are copied to the response body:
if 'If-Modified-Since' in self.request.headers:
# This flag means the client has its own copy of the resource
# and we may not return it. We won't.
# Just set the response code to Not Changed.
self.response.set_status(304)
else:
time.sleep(1) # todo: just making resource loading process noticeable
abs_file = os.path.join(os.path.split(__file__)[0], WHERE_STATIC_FILES_ARE_STORED, resource)
transmit_file(abs_file, self.response.out)
Perhaps if the database in GAE is faster than the file system, it is worthwhile to copy the contents of the file to the database at the first request of the file and then only go there. The question is open for me.
Server side optimization
As the version of the file, you can use both the version from VCS and the time of the last update of the file - there is no fundamental difference. I chose the second one, and with it simpler:
os.path.getmtime(file)
However, it seems to be not very good to interrogate the file system for each request - I / O is always slow. Therefore, you can collect information about the current versions of (all) static files on the first request and put the information in memcache. The result is such a hash:
{ 'cover.jpg': 123456, 'style.css': 234567 }
which will be used in the custom tag to find the latest version. Naturally, you need something like a singleton in case memcache goes bad:
class StaticFilesInfo():
@classmethod
def __get_static_files_info(cls):
info = memcache.get(cls.__name__)
if info is None:
info = cls.__grab_info()
time = MEMCACHE_TIME_PRODUCTION if is_production() else MEMCACHE_TIME_DEV_SERVER
memcache.set(cls.__name__, info, time)
return info
@classmethod
def __grab_info(cls):
"""
Obtain info about all files in managed 'static' directory.
This is done rarely.
"""
dir = os.path.join(os.path.split(__file__)[0], WHERE_STATIC_FILES_ARE_STORED)
hash = {}
for file in os.listdir(dir):
abs_file = os.path.join(dir, file)
hash[file] = int(os.path.getmtime(abs_file))
return hash
Features of the Google App Engine
You can collect information about all the static files, but what if the designer changes the picture? How does the server know it's time to update the cached versions of the files? In the general case, I don’t really imagine - you need to either start a daemon listening to changes to the file system, or remember to run scripts after the deployment.
But App Engine is a special case. In this system, development is carried out on the local machine, after which the finished code (and static files) are deployed (deployed) to the server. And, importantly, the files on the server can no longer be changed (until the next deployment). That is, it is enough to read the versions only once and no longer care that they can change.
The only thing is that with local development, files can very much change, and if in this case you don’t act alternatively, the browser will, for example, show the developer an old version of the image, which is inconvenient. But in this case, performance is not very important, so you can put data in memcache for a few seconds or not at all.
The source code of the finished example
code.google.com/p/investigations/source/browse/#svn%2Ftrunk%2Fnever-expire-http-resources
svn checkout investigations.googlecode.com/svn/trunk/never-expire-http-resources investigations
Summary
I haven’t uploaded it to appspot yet, but locally everything works and flies. People, take advantage of client-server optimization, do not answer stupidly 200 OK :)
UPD. In the comments they write (and I confirm) that for static files the same effect can be achieved through the standard static. That is, such a “manual” code is hardly suitable for statics processing - GAE can handle this better. However, the approach may be useful for processing dynamically created resources . In this context, ETag may be more convenient than Last-Modified for implementation.