Data caching, maybe the last thing you should use
- Transfer
Recently, I had a rather heated conflict with the popular PHP package for e-commerce. As a result, I wanted to talk about one common mistake in the architecture of web applications.
The package I worked with heavily used caching. It could not give more than 10 pages per second, if some "optional" cache settings were not included. Obviously, with such performance, they are actually not optional, but mandatory.
I think that when you have such a wonderful tool as memcached, you just want to use it to solve any performance problem. But in many cases, it should not be the first tool you are trying to use. And here's why:
- Caching may not work for all usersth - you open the page - it loads quickly. But is this true for all users? Caching very often allows you to optimize the loading time for most visitors, but often in reality you need the page to load quickly for everyone, without exception (if you follow the six sigma principle ). In practice, a request can always miss the cache for the same user, which further aggravates the situation. ( Translator's note : I know a very real case when the cache worked in an electronic store for 99% of users and did not work for 1% of visitors who had a long shopping history, as a result, the store worked slowly just for active buyers).
- Caching can lead you away from solving a problem- You look at the most slowly loading page and try to optimize it. But the trick here is that in reality, the performance problem may lie in another area (again six sigma). You "heal" the problem by caching, for example, the whole page, but the performance problem itself does not go anywhere and remains hidden ( Note translator : in order to pop up on other pages again and again).
- Managing the cache in reality is not an easy task - Have you ever struggled with a " runaway cache " or with a situation where a large number of cache elements are disabled at the same time?
Caching should be seen as a burden without which many applications cannot live. You should try to avoid this burden until you have exhausted the entire arsenal of easily applicable optimization methods.
Before introducing optimization, make sure that you went through this fairly simple list:
- Do you understand the execution plan for each request? If not, set long_query_time = 0 and use the mk-query-digest command to get a complete list of queries. Perform EXPLAIN for each of them, analyze the execution plan.
- Do you use SELECT * to then use only a small set of columns? Or do you select many rows from the database, but use only some of them? If so, then you select too much data, limiting the optimization of the DBMS level, such as the use of indexes.
-Do you know how many queries you use to generate one page? Are they all really necessary? Is it possible to turn some of these requests into a single request or remove them altogether? ( Note of the translator : A very common problem. I really know the case when the page displayed a list of students in the class, and then in the cycle for each student additional information was requested, including the name of the class. After the alteration, the number of requests was reduced from 61 to 3).
I think that as a conclusion we can say: “Optimization very rarely reduces the complexity of the application. Try to avoid complication by optimizing only what really needs to be optimized ”- quote from Justin’s slide - instrumentation-for-php .
From a long-term perspective, many applications should keep the architecture simple and not give in to the temptation to solve problems in the way "real boys do it."
Note translator : A completely real dialogue that happened not so long ago:
- So we have performance problems, we need to add caching, vertical partitioning and NoSQL DB for logins
- Guys - I looked at EXPLAIN - you have a fullscan query for 4,000 rows, I tried to create an index- everything accelerated 26 times.
Some remarks on the transfer
1. The term cache stampeding - I translated as a runaway cache(there was a temptation to translate as "tearing", but that would be wrong). In short, this is a situation where, for example, a certain query is executed long enough and the results of this query are cached, then sooner or later this data leaves the cache, and 10 pages on which this data is needed are rendered at the same time, then 10 slow queries are sent to the database, instead one. Usually they struggle with this by requesting data before they are thrown out of the cache. see for example
2. I want to note that the article does not say that you do not need to cache data. They need to be cached, but only after you try a few simple ways to optimize database queries. In other words, you need to start with a simple one.
What is this mistake?
The package I worked with heavily used caching. It could not give more than 10 pages per second, if some "optional" cache settings were not included. Obviously, with such performance, they are actually not optional, but mandatory.
I think that when you have such a wonderful tool as memcached, you just want to use it to solve any performance problem. But in many cases, it should not be the first tool you are trying to use. And here's why:
- Caching may not work for all usersth - you open the page - it loads quickly. But is this true for all users? Caching very often allows you to optimize the loading time for most visitors, but often in reality you need the page to load quickly for everyone, without exception (if you follow the six sigma principle ). In practice, a request can always miss the cache for the same user, which further aggravates the situation. ( Translator's note : I know a very real case when the cache worked in an electronic store for 99% of users and did not work for 1% of visitors who had a long shopping history, as a result, the store worked slowly just for active buyers).
- Caching can lead you away from solving a problem- You look at the most slowly loading page and try to optimize it. But the trick here is that in reality, the performance problem may lie in another area (again six sigma). You "heal" the problem by caching, for example, the whole page, but the performance problem itself does not go anywhere and remains hidden ( Note translator : in order to pop up on other pages again and again).
- Managing the cache in reality is not an easy task - Have you ever struggled with a " runaway cache " or with a situation where a large number of cache elements are disabled at the same time?
Alternative approach
Caching should be seen as a burden without which many applications cannot live. You should try to avoid this burden until you have exhausted the entire arsenal of easily applicable optimization methods.
What are these ways?
Before introducing optimization, make sure that you went through this fairly simple list:
- Do you understand the execution plan for each request? If not, set long_query_time = 0 and use the mk-query-digest command to get a complete list of queries. Perform EXPLAIN for each of them, analyze the execution plan.
- Do you use SELECT * to then use only a small set of columns? Or do you select many rows from the database, but use only some of them? If so, then you select too much data, limiting the optimization of the DBMS level, such as the use of indexes.
-Do you know how many queries you use to generate one page? Are they all really necessary? Is it possible to turn some of these requests into a single request or remove them altogether? ( Note of the translator : A very common problem. I really know the case when the page displayed a list of students in the class, and then in the cycle for each student additional information was requested, including the name of the class. After the alteration, the number of requests was reduced from 61 to 3).
I think that as a conclusion we can say: “Optimization very rarely reduces the complexity of the application. Try to avoid complication by optimizing only what really needs to be optimized ”- quote from Justin’s slide - instrumentation-for-php .
From a long-term perspective, many applications should keep the architecture simple and not give in to the temptation to solve problems in the way "real boys do it."
Note translator : A completely real dialogue that happened not so long ago:
- So we have performance problems, we need to add caching, vertical partitioning and NoSQL DB for logins
- Guys - I looked at EXPLAIN - you have a fullscan query for 4,000 rows, I tried to create an index- everything accelerated 26 times.
Some remarks on the transfer
1. The term cache stampeding - I translated as a runaway cache(there was a temptation to translate as "tearing", but that would be wrong). In short, this is a situation where, for example, a certain query is executed long enough and the results of this query are cached, then sooner or later this data leaves the cache, and 10 pages on which this data is needed are rendered at the same time, then 10 slow queries are sent to the database, instead one. Usually they struggle with this by requesting data before they are thrown out of the cache. see for example
2. I want to note that the article does not say that you do not need to cache data. They need to be cached, but only after you try a few simple ways to optimize database queries. In other words, you need to start with a simple one.