MySQL: design optimization between
Optimization is clearly not the backbone of a MySQL server. The purpose of this article is to explain to developers who do not work densely with databases and sometimes do not understand why a query that successfully processes in other DBMSs in MySQL shamelessly slows down how the between construction in MySQL is optimized.
MySQL uses a rule based optimizer. The rudiments of cost based optimization are certainly present in it, but not to the extent that I would like to see them. For this reason, often the powers of sets obtained after applying filters are calculated incorrectly. This leads to optimizer errors and the selection of an incorrect execution plan. Moreover, the optimizations obtained between optimizations cannot be changed explicitly: the indices for query execution and the table join order.
To understand the essence of the bug, we will form a test data set in the amount of 125 million records.
The attributes table is essentially a reference, for big_table and also contains two columns limiting the date range to one month.
For each attr_id , the large table contains 250,000 entries. Let's try to find out how many records are contained in big_table , taking into account the dates specified by the attributes table for the attribute one.
We get about 500 records (the query execution time is negligible and amounted to 00.050 seconds). It is logical to assume that since the data is distributed identically for all attribute values, when connecting to the attributes table instead of explicitly specifying the bind variables, the query time should increase slightly, and should not exceed 25 seconds. Well check:
Execution time: more than 15 minutes (at this point I interrupted the request). Why is this so? The thing is that MySQL does not support dynamic ranking, which is what the bug # 5982, created back in 2004, tells us . Let's see the execution plan:
The plan clearly shows that a full table scan of 125 million records is in progress. Strange decision. It will not help to rectify the situation and straight_join to change the order of the joins nor force index to explicitly indicate the use of the index. The thing is that at best we get a plan of the form:
Which tells us that the table big_table will be scanned at the correct index, but the index is not fully engaged, that is, only the first column will be used from it . In some degenerate cases, we can achieve the plan we need and make full use of the index, however, due to the inconstancy of the optimizer and the impossibility of applying this solution (I will not give its code here, it does not work in 90% of cases) in all 100% of cases, we need a different approach.
This trip is offered to us by MySQL itself. We will explicitly specify bindvariables. Of course, this is not always effective for a number of tasks, since it happens that a full scan is faster than an index scan, but obviously not in our case when you need to select 500 entries from 250,000. To solve the problem, we need to create the following procedure.
Those. at the beginning we open the cursor over five hundred records of the attribute table, and for each row of this table we make a request from big_table . Let's look at the result:
Runtime: 0: 00: 05.017 IMHO the result is much better. Not perfect, but it works.
Now we can consider the reverse example, when the search is performed not in the “transactions” table, but rather in the fact table.
This bug appears if you:
- work with the GeoIP database
- try to analyze the schedule
- fix currency rates on Forex
- calculate the city based on the operator’s numbering capacity
, etc.
First, create a test data set of 25 million rows.
Get a table of the form
And on the go, let's try to execute a query that is successfully optimized by all DBMSs except MySQL:
Lead time: 0: 00: 22.412 . In general, this is not an option, given that we know that such a request should return one unique row. And the higher the value of the variable you choose - the more records will be scanned, the query run time grows exponentially.
MySQL itself offers the following workaround to solve this problem:
Lead time: 0: 00: 00.350 . Not bad. But this solution has a number of drawbacks, in particular, you will not be able to join with other tables. Those. This request can exist exclusively atomically. For the possibility of joins, we will use the standard solution of RTree indexes (unless of course your directory needs transactions or you ensure its integrity with triggers, since this type of indexes still works only for MyISAM). For those who do not know what geometric objects are in MySQL, I will give an illustration of what they usually do in such cases:

Imagine a plane. On the abscissa axis will be our values for the search. The ordinate of your point is zero, because in this particular case, for simplification, we will look betweenfor one criterion only. If the criteria are more necessary to use multidimensional objects. The boundaries of the rectangle a and b are usually 1 and -1, respectively. Thus, the values from our reference book will cover the ray exiting 0. Also, they will be limited by the set of shaded rectangles. If the point belongs to this rectangle, then the identifier of this rectangle gives us the desired identifier of the record in the table. We start the conversion:
For those who dared to do this operation with me, we monitor the end time of update .
If you have reached this step, I advise you not to do it, since with incorrect database settings the creation of this index may take a week.
Well now you can compare the speed of work. First, let's look at the speed of execution of the initial version of the request and the “geometric” one, gradually increasing the limit value from 10 to 100.

Left is time, bottom is limit. As you can see from the figure, the time between (blue) grows exponentially depending on whether we are at the beginning or closer to the end, since for each next value of the bind variable we need to scan more and more lines. The "geometric" solution (pink) at such small values is simply a constant.
Let's try to compare order by limit 1 and geometry for larger values. To do this, we use the procedures to create a level playing field and conduct random sampling.

On the graph we see the result of a sequential increase in the number of procedure starts from 10,000 to 90,000 and the number of seconds spent on the corresponding operations. As you can see, the “geometric” solution (pink) is 2 times faster than solutions using order by limit 1 (yellow), and in addition, this solution can be applied in standard SQL.
I covered this topic solely because data jambs are not visible on small amounts of data, but when the database grows and more than 10 users begin to live on it, performance degradation becomes simply monstrous, and these types of queries can be found in almost every industrial database.
Good luck with your optimizations. If this article is interesting next time I’ll tell you about bugs that not only do not interfere with life, but even vice versa - they increase query performance when used correctly.
Z.Y.
MySQL version 5.5.11
all requests were made after the MySQL reboot to prevent the cache from getting results.
MySQL settings are far from standard, but the size of innodb buffers does not exceed 300 Mb, the sizes of MyISAM buffers (with the exception of the moment the index was created) do not exceed 100Mb.
sizes of the used files:
big_range_table.ibd 1740M
big_table.ibd 5520M - without indices
big_table.ibd 8268M - with indices
i.e. getting objects into the database cache before the start of the query is completely excluded.
MySQL uses a rule based optimizer. The rudiments of cost based optimization are certainly present in it, but not to the extent that I would like to see them. For this reason, often the powers of sets obtained after applying filters are calculated incorrectly. This leads to optimizer errors and the selection of an incorrect execution plan. Moreover, the optimizations obtained between optimizations cannot be changed explicitly: the indices for query execution and the table join order.
First, consider the bug: bugs.mysql.com/bug.php?id=5982 and possible ways to resolve it
To understand the essence of the bug, we will form a test data set in the amount of 125 million records.
drop table if exists pivot;
drop table if exists big_table;
drop table if exists attributes;
create table pivot
(
row_number int(4) unsigned auto_increment,
primary key pk_pivot (row_number)
)
engine = innodb;
insert into pivot(row_number)
select null
from information_schema.global_status g1, information_schema.global_status g2
limit 500;
create table attributes(attr_id int(10) unsigned auto_increment,
attribute_name varchar(32) not null,
start_date datetime,
end_date datetime,
constraint pk_attributes primary key(attr_id)
)
engine = innodb;
create table big_table(btbl_id int(10) unsigned auto_increment,
attr_attr_id int(10) unsigned,
record_date datetime,
record_value varchar(128) not null,
constraint pk_big_table primary key(btbl_id)
)
engine = innodb;
insert into attributes(attribute_name, start_date, end_date)
select row_number, str_to_date("20000101", "%Y%m%d"), str_to_date("20000201", "%Y%m%d") from pivot;
insert into big_table(attr_attr_id, record_date, record_value)
select p1.row_number,
date_add(str_to_date("20000101", "%Y%m%d"), interval p2.row_number + p3.row_number day),
p2.row_number * 1000 + p3.row_number
from pivot p1,
pivot p2,
pivot p3;
create index idx_big_table_attr_date on big_table(attr_attr_id, record_date);
The attributes table is essentially a reference, for big_table and also contains two columns limiting the date range to one month.
attr_id | attribute_name | start_date | end_date |
1 | 1 | 1/1/2000 12:00:00 AM | 2/1/2000 12:00:00 AM |
2 | 2 | 1/1/2000 12:00:00 AM | 2/1/2000 12:00:00 AM |
3 | 3 | 1/1/2000 12:00:00 AM | 2/1/2000 12:00:00 AM |
For each attr_id , the large table contains 250,000 entries. Let's try to find out how many records are contained in big_table , taking into account the dates specified by the attributes table for the attribute one.
select attr_attr_id,
max(record_date),
min(record_date),
max(record_value),
count(1)
from big_table
where attr_attr_id = 1 and record_date between str_to_date("20000101", "%Y%m%d") and str_to_date("20000201", "%Y%m%d")
group by attr_attr_id;
attr_attr_id | max (record_date) | min (record_date) | count (1) |
1 | 2/1/2000 12:00:00 AM | 1/3/2000 12:00:00 AM | 465 |
We get about 500 records (the query execution time is negligible and amounted to 00.050 seconds). It is logical to assume that since the data is distributed identically for all attribute values, when connecting to the attributes table instead of explicitly specifying the bind variables, the query time should increase slightly, and should not exceed 25 seconds. Well check:
select b.attr_attr_id,
max(b.record_date),
min(b.record_date),
max(b.record_value),
count(1)
from attributes a
join
big_table b
on b.attr_attr_id = a.attr_id and b.record_date between a.start_date and a.end_date
group by b.attr_attr_id;
Execution time: more than 15 minutes (at this point I interrupted the request). Why is this so? The thing is that MySQL does not support dynamic ranking, which is what the bug # 5982, created back in 2004, tells us . Let's see the execution plan:
id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
1 | SIMPLE | b | ALL | idx_big_table_attr_date | 125443538 | Using temporary; Using filesort | |||
1 | SIMPLE | a | eq_ref | PRIMARY | PRIMARY | 4 | test.b.attr_attr_id | 1 | Using where |
The plan clearly shows that a full table scan of 125 million records is in progress. Strange decision. It will not help to rectify the situation and straight_join to change the order of the joins nor force index to explicitly indicate the use of the index. The thing is that at best we get a plan of the form:
PRIMARY | b | ref | idx_big_table_attr_date | idx_big_table_attr_date | 5 | a.attr_id | 6949780 | Using where |
Which tells us that the table big_table will be scanned at the correct index, but the index is not fully engaged, that is, only the first column will be used from it . In some degenerate cases, we can achieve the plan we need and make full use of the index, however, due to the inconstancy of the optimizer and the impossibility of applying this solution (I will not give its code here, it does not work in 90% of cases) in all 100% of cases, we need a different approach.
This trip is offered to us by MySQL itself. We will explicitly specify bindvariables. Of course, this is not always effective for a number of tasks, since it happens that a full scan is faster than an index scan, but obviously not in our case when you need to select 500 entries from 250,000. To solve the problem, we need to create the following procedure.
drop procedure if exists get_big_table_data;
delimiter $$
create procedure get_big_table_data(i_attr_from int(10))
main_sql:
begin
declare v_attr_id int(10);
declare v_start_date datetime;
declare v_end_date datetime;
declare ex_no_records_found int(10) default 0;
declare
attr cursor for
select attr_id, start_date, end_date
from attributes
where attr_id > i_attr_from;
declare continue handler for not found set ex_no_records_found = 1;
declare continue handler for sqlstate '42S01' begin
end;
create temporary table if not exists temp_big_table_results(
attr_attr_id int(10) unsigned,
max_record_date datetime,
min_record_date datetime,
max_record_value varchar(128),
cnt int(10)
)
engine = innodb;
truncate table temp_big_table_results;
open attr;
repeat
fetch attr
into v_attr_id, v_start_date, v_end_date;
if not ex_no_records_found then
insert into temp_big_table_results(attr_attr_id,
max_record_date,
min_record_date,
max_record_value,
cnt
)
select attr_attr_id,
max(record_date) max_record_date,
min(record_date) min_record_date,
max(record_value) max_record_value,
count(1) cnt
from big_table b
where attr_attr_id = v_attr_id and record_date between v_start_date and v_end_date
group by attr_attr_id;
end if;
until ex_no_records_found
end repeat;
close attr;
select attr_attr_id,
max_record_date,
min_record_date,
max_record_value,
cnt
from temp_big_table_results;
end
$$
delimiter ;
* This source code was highlighted with Source Code Highlighter.
Those. at the beginning we open the cursor over five hundred records of the attribute table, and for each row of this table we make a request from big_table . Let's look at the result:
call get_big_table_data(0);
* This source code was highlighted with Source Code Highlighter.
Runtime: 0: 00: 05.017 IMHO the result is much better. Not perfect, but it works.
Now we can consider the reverse example, when the search is performed not in the “transactions” table, but rather in the fact table.
Now let's look at the bug bugs.mysql.com/bug.php?id=8113
This bug appears if you:
- work with the GeoIP database
- try to analyze the schedule
- fix currency rates on Forex
- calculate the city based on the operator’s numbering capacity
, etc.
First, create a test data set of 25 million rows.
drop table if exists big_range_table;
create table big_range_table(rtbl_id int(10) unsigned auto_increment,
value_from int(10) unsigned,
value_to int(10) unsigned,
range_value varchar(128),
constraint pk_big_range_table primary key(rtbl_id)
)
engine = innodb;
insert into big_range_table(value_from, value_to, range_value)
select @row_number := @row_number + 1, @row_number + 1, p1.row_number + p2.row_number + p3.row_number
from (select *
from pivot
where row_number <= 100) p1,
pivot p2,
pivot p3,
(select @row_number := 0) counter;
create index idx_big_range_table_from_to
on big_range_table(value_from, value_to);
create index idx_big_range_table_from
on big_range_table(value_from);
Get a table of the form
rtbl_id | value_from | value_to | range_value |
1 | 1 | 2 | 3 |
2 | 2 | 3 | 4 |
3 | 3 | 4 | 5 |
And on the go, let's try to execute a query that is successfully optimized by all DBMSs except MySQL:
select range_value
from big_range_table
where 10000000.5 >= value_from and 10000000.5 < value_to;
Lead time: 0: 00: 22.412 . In general, this is not an option, given that we know that such a request should return one unique row. And the higher the value of the variable you choose - the more records will be scanned, the query run time grows exponentially.
MySQL itself offers the following workaround to solve this problem:
select range_value
from big_range_table
where value_from <= 25000000
order by value_from desc
limit 1;
Lead time: 0: 00: 00.350 . Not bad. But this solution has a number of drawbacks, in particular, you will not be able to join with other tables. Those. This request can exist exclusively atomically. For the possibility of joins, we will use the standard solution of RTree indexes (unless of course your directory needs transactions or you ensure its integrity with triggers, since this type of indexes still works only for MyISAM). For those who do not know what geometric objects are in MySQL, I will give an illustration of what they usually do in such cases:

Imagine a plane. On the abscissa axis will be our values for the search. The ordinate of your point is zero, because in this particular case, for simplification, we will look betweenfor one criterion only. If the criteria are more necessary to use multidimensional objects. The boundaries of the rectangle a and b are usually 1 and -1, respectively. Thus, the values from our reference book will cover the ray exiting 0. Also, they will be limited by the set of shaded rectangles. If the point belongs to this rectangle, then the identifier of this rectangle gives us the desired identifier of the record in the table. We start the conversion:
alter table big_range_table engine = myisam, add column polygon_value polygon not null;
update big_range_table
set polygon_value =
geomfromwkb(polygon(linestring( /* двигаемся против часовой стрелке, задаем четыре точки и обратно к первой замыкаем полигон */
point(value_from, -1), /* нижняя левая */
point(value_to, -1), /* нижняя правая */
point(value_to, 1), /* верхняя правая */
point(value_from, 1), /* верхняя левая */
point(value_from, -1) /* к первой точке */
)));
For those who dared to do this operation with me, we monitor the end time of update .
select (select @first_value := variable_value from information_schema.global_status where variable_name = 'HANDLER_UPDATE') updated,
sleep(10) lets_sleep,
(select @second_value := variable_value from information_schema.global_status where variable_name = 'HANDLER_UPDATE') updated_in_a_ten_second,
@second_value - @first_value myisam_updated_records,
25000000 / (@second_value - @first_value) / 6 estimate_for_update_in_minutes,
(select 25000000 / (@second_value - @first_value) / 6 - time / 60 from information_schema.processlist where info like 'update big_range_table%') estimate_time_left_in_minutes;
If you have reached this step, I advise you not to do it, since with incorrect database settings the creation of this index may take a week.
create spatial index idx_big_range_table_polygon_value
on big_range_table(polygon_value);
Well now you can compare the speed of work. First, let's look at the speed of execution of the initial version of the request and the “geometric” one, gradually increasing the limit value from 10 to 100.
select *
from (select row_number * 5000 row_number from pivot order by row_number limit 10) p, big_range_table
where mbrcontains(polygon_value, pointfromwkb(point(row_number, 0))) and row_number < value_to;
select *
from (select row_number * 5000 row_number from pivot order by row_number limit 10) p, big_range_table
where value_from <= row_number and row_number < value_to;

Left is time, bottom is limit. As you can see from the figure, the time between (blue) grows exponentially depending on whether we are at the beginning or closer to the end, since for each next value of the bind variable we need to scan more and more lines. The "geometric" solution (pink) at such small values is simply a constant.
Let's try to compare order by limit 1 and geometry for larger values. To do this, we use the procedures to create a level playing field and conduct random sampling.
drop procedure if exists pbenchmark_mbrcontains;
delimiter $$
create procedure pbenchmark_mbrcontains(i_repeat_count int(10))
main_sql:
begin
declare v_random int(10);
declare v_range_value int(10);
declare v_loop_counter int(10) unsigned default 0;
begin_loop:
loop
set v_loop_counter = v_loop_counter + 1;
if v_loop_counter < i_repeat_count then
set v_random = round(2500000 * rand());
select range_value
into v_range_value
from big_range_table
where mbrcontains(polygon_value, pointfromwkb(point(v_random, 0))) and v_random < value_to;
iterate begin_loop;
end if;
leave begin_loop;
end loop begin_loop;
select v_loop_counter;
end
$$
delimiter ;
drop procedure if exists pbenchmark_limit;
delimiter $$
create procedure pbenchmark_limit(i_repeat_count int(10))
main_sql:
begin
declare v_random int(10);
declare v_range_value int(10);
declare v_loop_counter int(10) unsigned default 0;
begin_loop:
loop
set v_loop_counter = v_loop_counter + 1;
if v_loop_counter < i_repeat_count then
set v_random = round(2500000 * rand());
select range_value
into v_range_value
from big_range_table
where value_from <= v_random order by value_from desc limit 1;
iterate begin_loop;
end if;
leave begin_loop;
end loop begin_loop;
select v_loop_counter;
end
$$
delimiter ;
* This source code was highlighted with Source Code Highlighter.

On the graph we see the result of a sequential increase in the number of procedure starts from 10,000 to 90,000 and the number of seconds spent on the corresponding operations. As you can see, the “geometric” solution (pink) is 2 times faster than solutions using order by limit 1 (yellow), and in addition, this solution can be applied in standard SQL.
I covered this topic solely because data jambs are not visible on small amounts of data, but when the database grows and more than 10 users begin to live on it, performance degradation becomes simply monstrous, and these types of queries can be found in almost every industrial database.
Good luck with your optimizations. If this article is interesting next time I’ll tell you about bugs that not only do not interfere with life, but even vice versa - they increase query performance when used correctly.
Z.Y.
MySQL version 5.5.11
all requests were made after the MySQL reboot to prevent the cache from getting results.
MySQL settings are far from standard, but the size of innodb buffers does not exceed 300 Mb, the sizes of MyISAM buffers (with the exception of the moment the index was created) do not exceed 100Mb.
sizes of the used files:
big_range_table.ibd 1740M
big_table.ibd 5520M - without indices
big_table.ibd 8268M - with indices
i.e. getting objects into the database cache before the start of the query is completely excluded.