
Cache theory
Introduction
Caching is an integral part of all complex systems. Caching can significantly increase the speed of the application by saving difficult-to-calculate data for their subsequent use. However, I will not go into definitions, most developers are well aware of what it is.
In this article, I will try to "sort through" the caching problem, which focuses primarily on sites and content management systems. I warn you right away, these are my personal considerations that do not claim to be the ultimate truth. All the terminology is also mine, you can use it if you think it is necessary at your discretion. Constructive criticism is welcome.
Key Caching Issues
When caching data, as a rule, there is no problem in saving and reusing it. The main problem is to determine when the cache is no longer relevant. That is, to determine whether it is possible to use data from the cache, or whether it is necessary to calculate this data again, because something could change. This I will call the cache relevance problem (my terminology).
The second caching issue is the performance issue. But is our algorithm executed faster without caching than with caching? Despite the absurdity at first glance, it more than takes place. The fact is that caching algorithms also consume resources, and it can easily happen that the amount of resources consumed by the cache exceeds the amount of resources to calculate the data itself. This often happens in 2 cases. The first case is when a website of 10 pages is put on a powerful framework. Everything is simple here - the initial data is so small that the calculation speed is very high, and caching only slows down the work. The second case is the opposite - when the site is so large that the size of the cache grows to volumes leading to a significant slowdown in their search in the cache.
It is impossible to achieve a solution to both problems. It is only possible to choose a method providing maximum performance. And even better - that this method would be selected automatically.
The performance problem is not so relevant, and is solved either by disabling the use of caching in the configuration, or by periodically physically deleting the cached data and switching to faster equipment.
But with regard to the problem of relevance, everything is much more complicated ...
Cache theory
The author distinguishes 4 main types of information caching:
1. Independent or static - takes place when it is not required to check whether the object has changed. The algorithm is simple - if there is data in the cache, then give the data, otherwise calculate it. Effectively within the framework of one process, when frequent use of the results of complex intermediate calculations is required. After the process is completed, the static cache dies and for the second process all data is computed again. In general, in any programming language, all temporary variables are nothing more than a static cache. If function results caching is required, then you can use such an algorithm using static variables using the PHP example: 2. Explicitly dependent
function getData($id)
{
static $data = array();
if (isset($data[$id])) return $data[$id];
// вычисление данных
$data[$id] = $newData;
return $data[$id];
}
- occurs when the decision to update the cache is made according to some easily calculated attribute. This primarily refers to caching data taken from a particular file, as well as caching by time. In these cases, it is easy to determine if the file has changed or if the cache has expired (and the cache clearly depends on these factors). The algorithms here are also quite simple - they come down to checking conditions, but unlike the static cache, it can be used independently by different processes.
3. Implicitly - occurs when a change in the cache object depends on many factors. For example, the structure of an object includes many other objects that can also change. That is, the object is implicit(not directly) depends on others, which may also depend on others, etc. The totality of these factors is the implicit dependencies of the object, and to make a decision about whether the object has changed, you need to “poll” them all, which is not always possible or advisable. For example, a function that displays information taken from a database. We can cache the result of the function, but we need to know when the data will change in order to update it. And this is equivalent to accessing the table, which in terms of resource consumption is equivalent to performing a function. And if there are a lot of tables? The meaning of the cache is lost.
This is revealed through the so-called "dependency table". There are 2 columns in this table: objects and time of their change. When changing objects, it is necessary to update their times in the table. The access speed to the table itself is very high (as a rule, it is located in a static or explicitly dependent cache), and you can get the times of changing all the objects we need very quickly. And if at least one change time of the necessary objects exceeds the cache time, then the cache must be reset. Thus, implicit dependencies turn into explicit ones - the object depends on a certain set of times in the dependency table.
I will give an example. There is a certain CMS module that deals with displaying comments. We have 2 tables in the database - users and comments, which are related by a one-to-many relationship. When changing the data of some user, we set the current time in the "users" line in the dependency table. If the comments have changed, we will do the same with the “comments” line. The module knows that it depends on "users" and "comments" and if the time of at least one of them exceeds the time of the cache change, the cache is reset, otherwise it is used. I would like to draw your attention to the fact that “users” and “comments” are not necessarily the time for changing the table, but rather the times of changing some entity, which can consist of many tables and other parameters.
4. Conditionally dependent- takes place when an implicitly dependent cache can be reduced to explicitly dependent or static when certain conditions are met. For example, it can be argued that the time of the last change of the dependency table is the time of the last change of data on the site. If there is no unique data for the user on the page (read - he is not logged in), then the whole page will depend only on the dependency table. So, the whole page can be put in the cache, on this condition. This also includes the cache, which at the right time is simply physically deleted.
Conclusion
As you can see, data caching is a very complex process that requires a large amount of resources for its development. Do not forget that in our time the cost of iron can be cheaper than the cost of developing complex algorithms. Therefore, if there are problems with the site’s performance, then it may be worthwhile to just buy a more powerful server - in some cases it will be cheaper.
Described by me is only the basis. Here you can add a lot of things, I just do not want to make the article too large. If the article seems interesting, I will try to write a sequel.