Exploring MySQL JOIN Performance
I think it’s no secret to anyone that JOIN is considered a rather expensive operation, and many beginning programmers (who use MySQL) like to intimidate that JOIN is bad, and it is best to do without them if possible.
Let's examine this question in more detail and see if JOIN is really bad, and when to think about it at all.
To begin with, I would like to say right away that I will not do:
- MySQL tuning: all settings are taken by default (including innodb_buffer_pool_size = 8 Mb and more)
- integration with programming languages: all requests will be made through the MySQL Sequel Pro client , and time will be measured on the basis of his testimony
- obvious things, like joins when fetching 3 lines: the question is whether to save on matches or not, I don’t want to consider - we will consider saving tens of times, not tens of percent
We will have two plates that are simple to disgrace (the tables were made just for example and filled with random data):
Calls - 10 million lines:
Users - 100 thousand lines:
The names, I think, speak for themselves, indexes are only on the primary key (id). Purely, in principle, we could create some kind of indexes that would help us fulfill the queries that we will investigate, but our goal is different, namely, to investigate how fast JOIN works.
For research purposes, the Calls table was taken in two types - MyISAM and InnoDB, and the Users table , to which we make JOIN, of three types - MyISAM, InnoDB and MEMORY
All tests were carried out on my laptop, with MySQL version 5.5.9 on Mac OS X with default settings from MAMP. All tables fit into memory completely, queries were run several times to make sure everything got into the cache.
To begin with, let's just look at the speed of viewing rows in MyISAM and InnoDB by executing such a query (I recall that there are no indexes either by price or by user_id - we measure the FULL SCAN speed in MySQL):
And such a request:
Results (error less than 5%):
I don’t want to study in more detail why aggregate selections in MyISAM are so affected by the number of rows that fall under WHERE, but the fact remains that when I look at the table completely, MyISAM is 4.5 times faster than InnoDB. Hence the opinion that InnoDB is a “brake”, and that MySQL itself (with MyISAM) is very smart.
Mini-conclusion: at full sequential viewing of MyISAM is 5 times faster than InnoDB
Let's now connect the Users table to the case - not for nothing that we created it.
We will investigate requests of such a plan:
The parameter for cost is selected so that a certain percentage of entries in the Calls table fall under the selection.
If we make the query above to the muscle with the cost parameter that will correspond to N% of the rows, then MySQL will make a join in only N% of cases , and for the rest (100-N)% of the lines, he will not do JOIN. Here is such a clever MySQL.
Let's get down to testing with JOIN:
So far, times differ very little from FULL SCAN. It is understandable - after all, JOIN is made for a scanty number of lines.
Funny huh? Only 1% of the rows are joined, and the results for MyISAM + MyISAM are 2 times more than for MyISAM + InnoDB. It's pretty funny that the JOIN to InnoDB in this case is faster than the JOIN to MyISAM. And we have not started testing it yet :)!
It’s a shame for the Power (MyISAM), but what can I do ... It turns out that MyISAM is not so smart ... Or not? Let's look at the final test results.
Pay attention to the monstrous (!) Sample times with JOIN with MyISAM. But InnoDB was pleasantly surprised - due to its architecture, JOIN is not too expensive operation for InnoDB. To be honest, I was very surprised when I got such a result that the second fastest JOIN option is when InnoDB is connected to MyISAM.
Well, with MEMORY, I think, everything is clear - MEMORY gives an overhead of 525% (4,435 ms) per PK connection, InnoDB gives an overhead of 1,380% (11,655 ms), it’s a shame to talk about MyISAM.
The keen eye could notice that for our scenario (when we do JOIN to users to weed out all women from the call table) there is a way without JOIN, but with a simple listing of all user_id in IN ():
Such a query to a table of type MyISAM will work in 3,730 ms, and to InnoDB in 8,290 ms. The keen eye may notice that this method is faster than JOIN to MEMORY, although not by much. This method is suitable in cases where you have a very fast connection to MySQL (for example, a UNIX socket). In other cases, IMHO, with so many records, it is obvious that driving a huge amount of traffic between the MySQL server and the application server is not the best idea.
conclusionsYou can do it yourself: if you need a lot of joining, and for some reason you need high performance, use the combination MyISAM + InnoDB, or just InnoDB + InnoDB, if there are more than one joins. Well, if you are extreme, you can use MyISAM + MEMORY to get maximum performance in any scenario. There is another option with MEMORY + MEMORY, but for a large number of records I would not do that;).
UPD: I would like to thank the homm habrayuzer for very useful comments, for example this one . In general, I highly recommend reading the comments, they explain a lot of things that for some reason were not obvious to readers:
- the request cache is disabled
- JOIN is done on the primary key
- indexes on the Calls table are not created, because we do not set the task to optimize any particular query
Let's examine this question in more detail and see if JOIN is really bad, and when to think about it at all.
What I won’t write about
To begin with, I would like to say right away that I will not do:
- MySQL tuning: all settings are taken by default (including innodb_buffer_pool_size = 8 Mb and more)
- integration with programming languages: all requests will be made through the MySQL Sequel Pro client , and time will be measured on the basis of his testimony
- obvious things, like joins when fetching 3 lines: the question is whether to save on matches or not, I don’t want to consider - we will consider saving tens of times, not tens of percent
Initial conditions
We will have two plates that are simple to disgrace (the tables were made just for example and filled with random data):
Calls - 10 million lines:
id | INT PRIMARY KEY AUTO_INCREMENT |
user_id | INT |
cost | INT |
call_dt | Datetime |
tariff_id | INT |
Users - 100 thousand lines:
id | INT PRIMARY KEY AUTO_INCREMENT |
birthdate | DATE |
name | VARCHAR (10) |
sex | ENUM ('M', 'F') |
The names, I think, speak for themselves, indexes are only on the primary key (id). Purely, in principle, we could create some kind of indexes that would help us fulfill the queries that we will investigate, but our goal is different, namely, to investigate how fast JOIN works.
Table types
For research purposes, the Calls table was taken in two types - MyISAM and InnoDB, and the Users table , to which we make JOIN, of three types - MyISAM, InnoDB and MEMORY
First blood
All tests were carried out on my laptop, with MySQL version 5.5.9 on Mac OS X with default settings from MAMP. All tables fit into memory completely, queries were run several times to make sure everything got into the cache.
To begin with, let's just look at the speed of viewing rows in MyISAM and InnoDB by executing such a query (I recall that there are no indexes either by price or by user_id - we measure the FULL SCAN speed in MySQL):
SELECT MAX(cost) FROM calls WHERE user_id = 5000; -- запрос №1
And such a request:
SELECT MAX(cost) FROM calls; -- запрос №2
Results (error less than 5%):
No. | InnoDB, ms | Myisam ms |
---|---|---|
1 | 5 360 | 862 |
2 | 5,390 | 1,150 |
I don’t want to study in more detail why aggregate selections in MyISAM are so affected by the number of rows that fall under WHERE, but the fact remains that when I look at the table completely, MyISAM is 4.5 times faster than InnoDB. Hence the opinion that InnoDB is a “brake”, and that MySQL itself (with MyISAM) is very smart.
Mini-conclusion: at full sequential viewing of MyISAM is 5 times faster than InnoDB
Join
Let's now connect the Users table to the case - not for nothing that we created it.
We will investigate requests of such a plan:
SELECT MAX(calls.cost) FROM calls
JOIN users ON calls.user_id = users.id
WHERE calls.cost > %d AND users.sex = 'M'
The parameter for cost is selected so that a certain percentage of entries in the Calls table fall under the selection.
If we make the query above to the muscle with the cost parameter that will correspond to N% of the rows, then MySQL will make a join in only N% of cases , and for the rest (100-N)% of the lines, he will not do JOIN. Here is such a clever MySQL.
Let's get down to testing with JOIN:
FULL SCAN + JOIN 0.1% of lines
Users \ Calls | InnoDB, ms (JOIN, ms only) | MyISAM, ms (JOIN, ms only) |
---|---|---|
Innodb | 5,450 (~ 0) | 857 (~ 0) |
Myisam | 5,450 (~ 0) | 937 (~ 100) |
Memory | 5 350 (~ 0) | 845 (~ 100) |
So far, times differ very little from FULL SCAN. It is understandable - after all, JOIN is made for a scanty number of lines.
FULL SCAN + JOIN 1% of lines
Users \ Calls | InnoDB, ms (JOIN, ms only) | MyISAM, ms (JOIN, ms only) |
---|---|---|
Innodb | 5,660 (300) | 999 (140) |
Myisam | 6 530 (1 200) | 1 810 (950) |
Memory | 5,460 (100) | 911 (65) |
Funny huh? Only 1% of the rows are joined, and the results for MyISAM + MyISAM are 2 times more than for MyISAM + InnoDB. It's pretty funny that the JOIN to InnoDB in this case is faster than the JOIN to MyISAM. And we have not started testing it yet :)!
FULL SCAN + JOIN 10% of lines
Users \ Calls | InnoDB, ms | Myisam ms |
---|---|---|
Innodb | 7,230 (1,900) | 2 190 (990) |
Myisam | 16 100 (8 800) | 10,200 (9,000) |
Memory | 6,080 (700) | 1,440 (580) |
It’s a shame for the Power (MyISAM), but what can I do ... It turns out that MyISAM is not so smart ... Or not? Let's look at the final test results.
FULL SCAN + JOIN 100% of lines
Users \ Calls | InnoDB, ms | Myisam ms |
---|---|---|
Innodb | 18,000 (14,650) | 12,500 (11,655) |
Myisam | 100,000 (96,650) | 91 600 (90 750) |
Memory | 10 500 (7 150) | 5,280 (4,435) |
Pay attention to the monstrous (!) Sample times with JOIN with MyISAM. But InnoDB was pleasantly surprised - due to its architecture, JOIN is not too expensive operation for InnoDB. To be honest, I was very surprised when I got such a result that the second fastest JOIN option is when InnoDB is connected to MyISAM.
Well, with MEMORY, I think, everything is clear - MEMORY gives an overhead of 525% (4,435 ms) per PK connection, InnoDB gives an overhead of 1,380% (11,655 ms), it’s a shame to talk about MyISAM.
Replacing JOIN with IN (...)
The keen eye could notice that for our scenario (when we do JOIN to users to weed out all women from the call table) there is a way without JOIN, but with a simple listing of all user_id in IN ():
SET group_concat_max_len = 10000000;
SELECT GROUP_CONCAT(id) FROM users WHERE sex = 'M'; -- запрос отработает за 50 мс, выдав много-много килобайт текста
SELECT MAX(cost) FROM calls WHERE user_id IN(%s); -- %s из предыдущего запроса
Such a query to a table of type MyISAM will work in 3,730 ms, and to InnoDB in 8,290 ms. The keen eye may notice that this method is faster than JOIN to MEMORY, although not by much. This method is suitable in cases where you have a very fast connection to MySQL (for example, a UNIX socket). In other cases, IMHO, with so many records, it is obvious that driving a huge amount of traffic between the MySQL server and the application server is not the best idea.
conclusionsYou can do it yourself: if you need a lot of joining, and for some reason you need high performance, use the combination MyISAM + InnoDB, or just InnoDB + InnoDB, if there are more than one joins. Well, if you are extreme, you can use MyISAM + MEMORY to get maximum performance in any scenario. There is another option with MEMORY + MEMORY, but for a large number of records I would not do that;).
UPD: I would like to thank the homm habrayuzer for very useful comments, for example this one . In general, I highly recommend reading the comments, they explain a lot of things that for some reason were not obvious to readers:
- the request cache is disabled
- JOIN is done on the primary key
- indexes on the Calls table are not created, because we do not set the task to optimize any particular query