Caching Tutorial Part 1

Transfer

A fairly detailed and interesting presentation of material related to the cache and its use. Part 2 .

The author, Mark Nottingham , is a recognized expert in the field of HTTP protocol and web caching. He is chairman of the IETF HTTPbis Working Group . He took part in editing HTTP / 1.1, part. 6: Caching. Currently involved in the development of HTTP / 2.0.

The text is licensed under a Creative Commons Attribution-Noncommercial-No Derivative Works 3.0 Unported License .

From the translator: about typos and inaccuracies, please inform in PM. Thank.

Web Cache is located between one or more Web servers and clients, or multiple clients, and monitors incoming requests, while maintaining a copy of the responses - HTML-pages, images and files (collectively known as representations (representations); approx interpreter. - let me use the word “content” - in my opinion, it doesn’t hurt the ear), for my own needs. Then, if another request arrives with the same url, the cache can use the response saved before, instead of re-requesting to the server.

There are two main reasons why the web cache is used:

1. Reducing latency- since the data on request is taken from the cache (which is located “closer” to the client), it takes less time to receive and display content on the client side. This makes the Web more responsive (translator's note - “responsive” in the context of quick response to a request, rather than emotionally).

2. Reduced network traffic - reuse of content reduces the amount of data transmitted to the client. This, in turn, saves money if the customer pays for the traffic, and keeps the bandwidth requirements low and more flexible.

Types of Web Caches

Browser Cache

If you examine the settings window of any modern web browser (for example, Internet Explorer, Safari or Mozilla), you will probably notice the “Cache” setting. This option allows you to select an area of the hard drive on your computer to store previously viewed content. The browser cache works according to fairly simple rules. It simply checks if the data is “fresh”, usually once per session (that is, once in the current browser session).

This cache is especially useful when the user clicks the back button or clicks on a link to see the page that he has just viewed. Also, if you use the same navigation images on your site, they will be selected from the browser cache almost instantly.

Proxy cache

The proxy cache works on a similar basis, but on a much larger scale. Proxies are served by hundreds or thousands of users; large corporations and Internet service providers often configure them on their firewalls or use them as separate devices (intermediaries).

Since proxies are not part of the client or the source server, but are turned to the network, requests must be redirected to them somehow. One way is to use your browser settings to manually tell it which proxy to access; another way is to use interception proxy. In this case, proxies process web requests redirected to them by the network so that the client does not need to configure them or even know about their existence.

Proxy caches are a kind of shared cache (shared cache): instead of serving one person, they work with a large number of users and therefore are very good at reducing latency and network traffic. Mostly because popular content has been requested many times.

Gateway Cache

Also known as “reverse proxy cache” or “surrogate cache”, gateways are also intermediaries, but instead of being used by system administrators to save bandwidth, they (gateways) are usually used by webmasters in order to make their sites more scalable, reliable and efficient.

Requests can be redirected to gateways by a number of methods, but a load balancer in one form or another is usually used.

Content delivery networks (CDNs) distribute gateways all over the Internet (or some part of it) and deliver cached content to interested websites. Speedera and Akamai are examples of CDN.

This tutorial is mainly focused on browser caches and proxies, but some information is also suitable for those who are interested in gateways.

Why should I use it

Caching is one of the most misunderstood technologies on the Internet. Webmasters, in particular, are afraid of losing control of their site, because proxies can “hide” their users, making it difficult to monitor traffic.

Unfortunately for them (webmasters), even if the web cache did not exist, there are too many variables on the Internet to ensure that site owners will be able to get an accurate picture of how users use the site. If this is a big problem for you, this guide will teach you how to get the necessary statistics without making your site a “cache hater”.

Another problem is that the cache can store content that is outdated or expired.

On the other hand, if you are responsible for designing your website, the cache can help with faster loading and keeping the load on the server and Internet connection within the acceptable range. The difference can be impressive: loading a site that does not work with a cache may take a few seconds; while the benefits of using caching can make it seem instantaneous. Users will appreciate the short loading time of the site and, perhaps, will visit it more often.

Think of it this way: many large Internet companies spend millions of dollars setting up server farms around the world to replicate content in order to speed up access to data for their users as soon as possible. The cache does the same for you and it is much closer to the end user.

CDN, from this point of view, is an interesting development, because, unlike many proxy caches, their gateways are aligned with the interests of the cached website. However, even when you are using CDN, you should still consider that there will be proxies and subsequent caching in the browser.

In summary, the proxy and browser cache will be used whether you like it or not. Remember, if you do not configure your site for correct caching, it will use the default cache settings.

How does web cache work

All types of caches have a specific set of rules that they use to determine when to take content from the cache, if available. Some of these rules are set by protocols (HTTP 1.0 / HTTP 1.1), some by cache administrators (browser users or proxy administrators).

Generally speaking, these are the most general rules (do not worry if you do not understand the details, they will be explained below):

If the response headers tell the cache not to save them, it will not.
If the request is authorized or secure (that is, HTTPS), it will not be cached.
Cached content is considered “fresh” (that is, it can be sent to the client without verification from the source server) if:
- It has an expiration time or other heading that controls the lifetime, and it has not expired yet.
- If the cache recently checked the content and it was modified a long time ago.
Fresh content is taken directly from the cache, without checking from the server.
If the content is outdated, the source server will be asked to validate it or tell the cache whether the existing copy is still up to date.
Under certain circumstances - for example, when it is disconnected from the network - the cache can save outdated responses without checking from the source server.

If there is no validator ( ETagor Last-Modifiedtitle) in the response , and it does not contain any explicit information about freshness, the content will usually (but not always) be considered non-cached.

Freshness and validation are the most important ways a cache works with content. Fresh content will be available instantly from the cache; valid content will avoid re-sending all packages if it has not been modified.

Tags: