Optimizing complex MySQL queries
- From the sandbox
- Tutorial
Introduction
MySQL is a very controversial product. On the one hand, it has an incomparable speed advantage over other databases on the simplest operations / queries. On the other hand, it has such an undeveloped (if not undeveloped) optimizer that it loses outright on complex queries.
First of all, I would like to limit the range of optimization problems to “wide” and large tables. Let's say up to 10m records and up to 20Gb in size, with a large number of mutable requests to them. If your table has many millions of records, each 100 bytes in size, and five simple possible queries to it, this article is not for you. NB: MySQL innodb / percona engine is considered - hereinafter simply MySQL.
Most queries are not very complex. Therefore, it is very important to know how to build an index for use by the desired query and / or modify the query so that it uses existing indexes. We will consider the work of the optimizer for selecting the index of regular queries ( select_type = simple ), without joins, subqueries and joins.
We discard the simplest cases for very small tables, for which the optimizer often uses type = all (full scan) regardless of the availability of indexes - for example, a classifier with 40 entries. MySQL has an algorithm for using multiple indexes ( index merge ), but this algorithm does not work very often, and only without order by. The only sensible way to try to use index merge is by fetching across different columns with OR .
Another digression: it is understood that the reader is already familiar with explain . Often the query itself is modified a little by the optimizer, so in order to understand why this or that index was used or not, you should call
explain extended select xxx;
and then show warnings;
which will show the request modified by the optimizer.Covering Index - From Thick Tables to Indexes
So the task: let us have a fairly simple request that is executed quite often, but for such a frequent call is relatively slow. Consider the strategy of casting our query to using index as the fastest choice.
Why using index ? Yes, MySQL uses only B-tree indexes, but nevertheless, MySQL tries to keep indexes entirely in memory whenever possible (and at the same time can even add adaptive hash indexes on top of them) - actually all this gives a fabulous increase in MySQL performance relative to other databases. In addition, the optimizer will often prefer to use, although not a better, but already loaded index in memory, than a better one, but on disk (for type = index / range ). Hence several conclusions:
- indexes that are too heavy are evil. Either they will not be used because they are not yet in memory, or they will not be loaded into memory because other indexes will be supplanted.
- if the size of the index is comparable to the size of the table, or if the set of indexes used for different frequent queries is significantly larger than the server’s memory size, no significant optimization can be achieved.
- The nuance is to index / sort by TEXT - doom yourself to permanent using filesort .
One subtle point that you sometimes forget about is that MySQL only creates clustered indexes. Cluster - in fact, indicating not the absolute position of the record in the table, but (conditionally) the record of the primary key, which in turn allows you to retrieve the desired record. But MySQL, without further ado, in order to do without a second lookup, does just that - expanding any key to the width of the primary key. Thus, if you have primary key (ID), key (A, B, C) in the table , then in reality your second key is not (A, B, C) , but (A, B, C, ID) . Hence morality - the fat primary key is evil.
It should be noted the difference in query caching in different databases. If PostgreSQL / Oracle caches query plans (as if prepare for some timeout), then MySQL simply caches the query STRING (including the parameter value) and saves the query result. That is, if you consistently select
select AAA from BBB where CCC=DDD
several times - if the DDD does not contain changing functions and the AAA table has not changed (in the sense of isolation used), the result will be taken directly from the cache. A pretty controversial improvement. Thus, we believe that we do not just call the same request several times. Query parameters change, table data changes. The best option is to use a covering index. What index will be covering?
- First, look at the clause order by . The index used must begin with the same columns that are mentioned in order by , in the same or in completely reverse sorting. If the sorting is not direct or reverse, the index cannot be used. There is one thing but ... MySQL still does not support indexes with mixed sorts. The index is always asc . So if you have order by A asc, B desc - say goodbye to using index .
- Columns that are retrieved must be present in the coverage index. Very often this is an impossible condition due to the endless growth of the index, which, as you know, is evil. Therefore, there is a way around this point - using self join . That is, splitting the query into row selection and data extraction. Firstly, according to the given condition, we select only the columns of the primary key (which is always present in the index cluster), and secondly, the result obtained is joined to the selection of all required columns using this same primary key. Thus, we will have a clean using index in the first select, and eq_ref (the essence of the plural const ) for the second select. So, we get something similar to:
select AAA,BBB,CCC,DDD from tableName as a join tableName as b using (PK) «where over table b»
- Next is where . Here, in the worst case, we can iterate over the entire index ( type = index ), but if possible we should try to use functions that do not go beyond type = range ( >,> =, <, <=, like “xxx%” and so on) . The index used must include all fields from where in order to preserve using index . As noted above - you can try to use index_merge - but often this is simply not possible with complex conditions.
Actually, this is all that can be done for the case when we have only one type of request. Unfortunately, the MySQL optimizer is not always able to select it for the query if there is a covering index. Well, in this case, you have to help the optimizer using the standard use / force index hints .
Isolation of thick fields from a covering index - from thick indexes to thin ones
But what if we have several types of queries, or require different sortings and use thick fields ( varchar )? Just count the varchar field index size (100) in a million records. And if this field is used in different types of queries - for which we have different covering indexes? Is it possible to have in memory only ONE index for this thick field, while maintaining the same (or almost the same) performance in different queries? So - the last point.
- Thick and thin fields. Obviously, having several DIFFERENT key options using thick margins is an inadmissible luxury. Therefore, whenever possible, we should try to have only one key starting in a thick field. And here it is appropriate to use some artificial algorithm for replacing conditions. That is, replace the condition for a thick field with a join according to the results of this condition. For example:
instead of creating the key key (fatB, A) we will create a thin key (key) and a thick key (fatB) . And we rewrite the condition as follows.select A from tableName where A=1 or fatB='test'
create temporary table tmp as select PK from tableName where fatB='test'; select A from tableName left join tmp using (PK) where A=1 or tmp.PK is not null;
Therefore, we can have many thin keys, for different requests and only one thick field fatB . Real memory savings, while maintaining almost full performance.
Self Parsing Assignment
It is required to create a minimum number of keys (in terms of memory) and optimize queries of the form:
select A,B,C,D from tableName where A=1 and B=2 or C=3 and D like 'test%';
select A,C,D from tableName where B=3 or C=3 and D ='test' order by B;
Suppose queries are not reducible to type = range .Bibliography
- High Performance MySQL, 2nd Edition
Optimization, Backups, Replication, and More
By Baron Schwartz, Peter Zaitsev, Vadim Tkachenko, Jeremy D. Zawodny, Arjen Lentz, Derek J. Balling
Publisher: O'Reilly Media
Released: June 2008
Pages: 712 - www.mysqlperformanceblog.com