Some interesting MySQL features
In the not very distant past, I had to dig a bit into the source code of MySQL, and understand some aspects of its work. In the course of work with the spatula, and the experimenters, I came across several very interesting features, some of which are just funny, and in the case of some, it is very interesting to understand what the programmer was guided by, who decided to do just that.
Let's start with an interesting type like ENUM.
So, we have a table, it has two columns. The first, a , type ENUM, the second, b , INT. There are three rows in the table, all three have b equal to 1. I wonder what the minimum and maximum elements in column a are equal to .
It seems strange, it would be reasonable if the smallest was 'a' and the largest was 'c'.
But what if you choose the minimum and maximum only among those lines where b = 1? That is, among all the lines?
This is how we got MySQL to change its mind on how to compare fields in ENUM by simply adding a predicate.
The solution to this behavior is that in the first case, MySQL uses an index, and in the second, no. This, of course, does not explain why MySQL compares ENUMs differently for sorting in the index, and in the usual comparison.
The second example is simpler and more concise:
When I showed this query to my colleague who is developing the SQL parser, his question was not “why does this query return two strings”, but “how to write an SQL parser so that such a query is valid without writing a rule specifically allowing such a request. ”
Interestingly, not every SELECT in brackets will work, in particular, UNION in brackets is a syntax error:
Some more interesting examples under the cut
In general, with UNION and LIMIT you do not need to go far beyond the example of strange behavior:
Suddenly, only one row returned, although both tables are not empty. Because the second LIMIT belongs to the whole query, not just the right-hand side of UNION.
Here it is necessary to talk about such a thing as shift-reduce conflict. In modern open source databases, the parser is often written in bison. Such a parser is the so-called L1 parser, which means that the parser must understand the purpose of the next token by looking no further than one token forward. For example, in the query above, looking at the word LIMIT, the parser cannot understand whether this LIMIT belongs to the second query, or to the entire UNION. When the rules are written so that situations are possible in which it is impossible to understand the purpose of the token by looking only at the next token, this is called shift-reduce conflict. In this case, the parser will choose a solution based on a specific set of rules. This is very bad, because it leads to the fact that quite normal requests lead to errors. What,
This cannot be done, due to the shift-reduce conflict. Looking at the first LIMIT, the parser does not yet know that there will be a second ahead, and mistakenly believes that the first limit applies to the entire request.
In PostgreSQL, there are no shift-reduce conflicts at all in the parser. Specifically, this situation is resolved there due to the fact that LIMIT can only be with UNION, but not with SELECTs that it combines.
There are more than 160 such conflicts in MySQL. This is amazing, because it means that there are 160 places where the parser may not correctly understand what is expected of it.
Хороший пример такого конфликта — это соединения. Как известно, в MySQL поддерживаются CROSS JOINs, у которых нет предиката, и INNER JOINs, у которых предикат есть. Вообще говоря, CROSS JOIN и INNER JOIN — это разные вещи, но в MySQL это синонимы. То есть у INNER JOIN может не быть предиката, а у CROSS JOIN он может быть. В частности, это приводит к интересной ошибке:
В момент, когда парсер видит первое ON, он еще не знает, что впереди его ждет второе, и сталкивается с выбором: либо это ON для hru и baa, либо hru и baa соединяются без предиката, а текущий ON — это ON для moo и результата соединения hru и baa. Парсер ошибочно выбирает второе, что приводит к совершенно не нужной в этой ситуации ошибке. Если INNER JOIN заменить на LEFT JOIN, для которого варианта без предиката не сущестует, то запрос выполнится:
The most interesting thing here is that in Bison you have to specify the number of shift-reduce conflicts directly in the code, otherwise the code will not compile. That is, at some point in time, one of the programmers in MySQL made CROSS JOIN and INNER JOIN synonymous, which in itself does not make sense, after which he tried to collect the code, and he did not get together with a compilation error warning that the parser is now not will be able to parse certain requests. What the programmer, instead of doing everything right, found a constant indicating the number of errors in the parser, and increased it.
Although if we talk about some interesting decisions sometimes programmers in MySQL make, it’s best to recall this story:
http://bugs.mysql.com/bug.php?id=27877
In it, one of the programmers deliberately made the letter 's' equal to the character 'ß' in collation by default for utf8. This is very ironic, because the only language in which it could even make sense remotely is German, but it is this change that makes this collation completely inapplicable to the German language, because now lines that are completely different from each other become are equal.
This change was not only useless, it also made the process of switching from 5.0 to 5.1 for databases with utf8 rows in German very painful, because unique indexes suddenly began to contain duplicate elements.
Speaking of collations, I still really like this example:
Suppose we have a table with three rows with different collations:
We execute the following request:
MySQL reasonably complains that swedish and spanish cannot be compared because it is not clear how to compare them.
Let's write a completely identical query:
Suddenly, the query became valid, although it should still compare the swedish and spanish string. And if I want the other way around?
On the contrary, it is impossible.
If you delve into the code, you can understand that in MySQL BETWEEN is implemented in a very strange way: if the first or second parameter has binary collation, then all rows will be compared as binary, and collation will be ignored. But if binary collation has a third argument, then the same logic does not apply.
Speaking about how strange functions work in MySQL, we conclude this article with the most beautiful example.
No surprises here.
This is also reasonable, line 11 is less than 9. And what happens if 11 is added to 11?
Of course, 18. It turns out that the function returns a different value depending on the context! Is it possible to make the same LEAST return three different values depending on the context? Turns out yes
Although it must be said here that in one case we met a warning. But we still managed to force the same operator with the same arguments to return three different values.
To make an even more amazing discovery, you need to get acquainted with the NULLIF function. This function takes two arguments, and returns NULL if they are equal, or the value of the first argument if they are not equal. Putting aside the question of why such a function exists at all, let's look at the result of the following two queries:
In the first case, we got NULL, which means that LEAST is really equal to the string "11". In the second case, in the same query, with the same types of arguments, but with a different constant in NULLIF, we got the value 9! That is, with absolutely identical types of parameters in the first case, LEAST returned "11", and in the second - 9.
But you can do even better:
In this query, LEAST returned something other than string "9" (otherwise NULLIF would return NULL), but it returned string "9" at the same time!
If you look at the code, then this is really what happens. LEAST is executed twice, comparing the parameters for the first time as strings, and the second time as integers.
Let's start with an interesting type like ENUM.
mysql> CREATE TABLE enums(a ENUM('c', 'a', 'b'), b INT, KEY(a));
Query OK, 0 rows affected (0.36 sec)
mysql> INSERT INTO enums VALUES('a', 1), ('b', 1), ('c', 1);
Query OK, 3 rows affected (0.05 sec)
Records: 3 Duplicates: 0 Warnings: 0
So, we have a table, it has two columns. The first, a , type ENUM, the second, b , INT. There are three rows in the table, all three have b equal to 1. I wonder what the minimum and maximum elements in column a are equal to .
mysql> SELECT MIN(a), MAX(a) FROM enums;
+--------+--------+
| MIN(a) | MAX(a) |
+--------+--------+
| c | b |
+--------+--------+
1 row in set (0.00 sec)
It seems strange, it would be reasonable if the smallest was 'a' and the largest was 'c'.
But what if you choose the minimum and maximum only among those lines where b = 1? That is, among all the lines?
mysql> SELECT MIN(a), MAX(a) FROM enums WHERE b = 1;
+--------+--------+
| MIN(a) | MAX(a) |
+--------+--------+
| a | c |
+--------+--------+
1 row in set (0.00 sec)
This is how we got MySQL to change its mind on how to compare fields in ENUM by simply adding a predicate.
The solution to this behavior is that in the first case, MySQL uses an index, and in the second, no. This, of course, does not explain why MySQL compares ENUMs differently for sorting in the index, and in the usual comparison.
The second example is simpler and more concise:
mysql> (SELECT * FROM moo LIMIT 1) LIMIT 2;
+------+
| a |
+------+
| 1 |
| 2 |
+------+
2 rows in set (0.00 sec)
When I showed this query to my colleague who is developing the SQL parser, his question was not “why does this query return two strings”, but “how to write an SQL parser so that such a query is valid without writing a rule specifically allowing such a request. ”
Interestingly, not every SELECT in brackets will work, in particular, UNION in brackets is a syntax error:
mysql> (SELECT * FROM moo UNION ALL SELECT * FROM hru) LIMIT 2;
ERROR 1064 (42000): You have an error in your SQL syntax; check the manual that corresponds to your MySQL server version for the right syntax to use near 'UNION ALL SELECT * FROM hru) LIMIT 2' at line 1
Some more interesting examples under the cut
In general, with UNION and LIMIT you do not need to go far beyond the example of strange behavior:
mysql>
-> SELECT 1 FROM moo LIMIT 1
-> UNION ALL
-> SELECT 1 FROM hru LIMIT 1;
+---+
| 1 |
+---+
| 1 |
+---+
1 row in set (0.00 sec)
Suddenly, only one row returned, although both tables are not empty. Because the second LIMIT belongs to the whole query, not just the right-hand side of UNION.
Here it is necessary to talk about such a thing as shift-reduce conflict. In modern open source databases, the parser is often written in bison. Such a parser is the so-called L1 parser, which means that the parser must understand the purpose of the next token by looking no further than one token forward. For example, in the query above, looking at the word LIMIT, the parser cannot understand whether this LIMIT belongs to the second query, or to the entire UNION. When the rules are written so that situations are possible in which it is impossible to understand the purpose of the token by looking only at the next token, this is called shift-reduce conflict. In this case, the parser will choose a solution based on a specific set of rules. This is very bad, because it leads to the fact that quite normal requests lead to errors. What,
mysql> SELECT 1 FROM moo
-> UNION ALL
-> SELECT 1 FROM hru LIMIT 1
-> LIMIT 2;
ERROR 1064 (42000): You have an error in your SQL syntax; check the manual that corresponds to your MySQL server version for the right syntax to use near 'LIMIT 2' at line 4
This cannot be done, due to the shift-reduce conflict. Looking at the first LIMIT, the parser does not yet know that there will be a second ahead, and mistakenly believes that the first limit applies to the entire request.
In PostgreSQL, there are no shift-reduce conflicts at all in the parser. Specifically, this situation is resolved there due to the fact that LIMIT can only be with UNION, but not with SELECTs that it combines.
There are more than 160 such conflicts in MySQL. This is amazing, because it means that there are 160 places where the parser may not correctly understand what is expected of it.
Хороший пример такого конфликта — это соединения. Как известно, в MySQL поддерживаются CROSS JOINs, у которых нет предиката, и INNER JOINs, у которых предикат есть. Вообще говоря, CROSS JOIN и INNER JOIN — это разные вещи, но в MySQL это синонимы. То есть у INNER JOIN может не быть предиката, а у CROSS JOIN он может быть. В частности, это приводит к интересной ошибке:
mysql> SELECT * FROM
-> moo
-> INNER JOIN
-> hru
-> INNER JOIN
-> baa
-> ON hru.a = baa.a
-> ON moo.a = hru.a
-> ;
ERROR 1064 (42000): You have an error in your SQL syntax; check the manual that corresponds to your MySQL server version for the right syntax to use near 'ON moo.a = hru.a' at line 8
В момент, когда парсер видит первое ON, он еще не знает, что впереди его ждет второе, и сталкивается с выбором: либо это ON для hru и baa, либо hru и baa соединяются без предиката, а текущий ON — это ON для moo и результата соединения hru и baa. Парсер ошибочно выбирает второе, что приводит к совершенно не нужной в этой ситуации ошибке. Если INNER JOIN заменить на LEFT JOIN, для которого варианта без предиката не сущестует, то запрос выполнится:
mysql> SELECT * FROM
-> moo
-> LEFT JOIN
-> hru
-> LEFT JOIN
-> baa
-> ON hru.a = baa.a
-> ON moo.a = hru.a
-> ;
+------+------+------+------+
| a | a | b | a |
+------+------+------+------+
| 1 | 1 | 1 | 1 |
| 2 | 2 | 2 | 2 |
+------+------+------+------+
2 rows in set (0.00 sec)
The most interesting thing here is that in Bison you have to specify the number of shift-reduce conflicts directly in the code, otherwise the code will not compile. That is, at some point in time, one of the programmers in MySQL made CROSS JOIN and INNER JOIN synonymous, which in itself does not make sense, after which he tried to collect the code, and he did not get together with a compilation error warning that the parser is now not will be able to parse certain requests. What the programmer, instead of doing everything right, found a constant indicating the number of errors in the parser, and increased it.
Although if we talk about some interesting decisions sometimes programmers in MySQL make, it’s best to recall this story:
http://bugs.mysql.com/bug.php?id=27877
In it, one of the programmers deliberately made the letter 's' equal to the character 'ß' in collation by default for utf8. This is very ironic, because the only language in which it could even make sense remotely is German, but it is this change that makes this collation completely inapplicable to the German language, because now lines that are completely different from each other become are equal.
This change was not only useless, it also made the process of switching from 5.0 to 5.1 for databases with utf8 rows in German very painful, because unique indexes suddenly began to contain duplicate elements.
Speaking of collations, I still really like this example:
Suppose we have a table with three rows with different collations:
CREATE TABLE strings(
swedish VARCHAR(100) COLLATE utf8_swedish_ci,
spanish VARCHAR(100) COLLATE utf8_spanish_ci,
bin VARCHAR(100) COLLATE utf8_bin
);
We execute the following request:
mysql> SELECT * FROM strings WHERE swedish > bin AND swedish < spanish;
ERROR 1267 (HY000): Illegal mix of collations (utf8_swedish_ci,IMPLICIT) and (utf8_spanish_ci,IMPLICIT) for operation '<'
MySQL reasonably complains that swedish and spanish cannot be compared because it is not clear how to compare them.
Let's write a completely identical query:
mysql> SELECT * FROM strings WHERE swedish BETWEEN bin AND spanish;
Empty set (0.00 sec)
Suddenly, the query became valid, although it should still compare the swedish and spanish string. And if I want the other way around?
mysql> SELECT * FROM strings WHERE swedish BETWEEN spanish AND bin;
ERROR 1270 (HY000): Illegal mix of collations (utf8_swedish_ci,IMPLICIT), (utf8_spanish_ci,IMPLICIT), (utf8_bin,IMPLICIT) for operation 'between'
On the contrary, it is impossible.
If you delve into the code, you can understand that in MySQL BETWEEN is implemented in a very strange way: if the first or second parameter has binary collation, then all rows will be compared as binary, and collation will be ignored. But if binary collation has a third argument, then the same logic does not apply.
Speaking about how strange functions work in MySQL, we conclude this article with the most beautiful example.
mysql> SELECT LEAST(9, 11);
+--------------+
| LEAST(9, 11) |
+--------------+
| 9 |
+--------------+
1 row in set (0.00 sec)
No surprises here.
mysql> SELECT LEAST("9", "11");
+------------------+
| LEAST("9", "11") |
+------------------+
| 11 |
+------------------+
1 row in set (0.00 sec)
This is also reasonable, line 11 is less than 9. And what happens if 11 is added to 11?
mysql> SELECT LEAST("9", "11") + LEAST("9", "11");
+-------------------------------------+
| LEAST("9", "11") + LEAST("9", "11") |
+-------------------------------------+
| 18 |
+-------------------------------------+
1 row in set (0.00 sec)
Of course, 18. It turns out that the function returns a different value depending on the context! Is it possible to make the same LEAST return three different values depending on the context? Turns out yes
mysql> SELECT LEAST("9e1", "110");
+---------------------+
| LEAST("9e1", "110") |
+---------------------+
| 110 |
+---------------------+
1 row in set (0.00 sec)
mysql> SELECT LEAST("9e1", "110") + 0;
+-------------------------+
| LEAST("9e1", "110") + 0 |
+-------------------------+
| 90 |
+-------------------------+
1 row in set (0.00 sec)
mysql> SELECT LEAST("9e1", "110") & -1;
+--------------------------+
| LEAST("9e1", "110") & -1 |
+--------------------------+
| 9 |
+--------------------------+
1 row in set, 1 warning (0.00 sec)
mysql> SHOW WARNINGS;
+---------+------+------------------------------------------+
| Level | Code | Message |
+---------+------+------------------------------------------+
| Warning | 1292 | Truncated incorrect INTEGER value: '9e1' |
+---------+------+------------------------------------------+
1 row in set (0.00 sec)
Although it must be said here that in one case we met a warning. But we still managed to force the same operator with the same arguments to return three different values.
To make an even more amazing discovery, you need to get acquainted with the NULLIF function. This function takes two arguments, and returns NULL if they are equal, or the value of the first argument if they are not equal. Putting aside the question of why such a function exists at all, let's look at the result of the following two queries:
mysql> SELECT NULLIF(LEAST("9", "11"), "11") + 0;
+------------------------------------+
| NULLIF(LEAST("9", "11"), "11") + 0 |
+------------------------------------+
| NULL |
+------------------------------------+
1 row in set (0.00 sec)
mysql> SELECT NULLIF(LEAST("9", "11"), "12") + 0;
+------------------------------------+
| NULLIF(LEAST("9", "11"), "12") + 0 |
+------------------------------------+
| 9 |
+------------------------------------+
1 row in set (0.00 sec)
In the first case, we got NULL, which means that LEAST is really equal to the string "11". In the second case, in the same query, with the same types of arguments, but with a different constant in NULLIF, we got the value 9! That is, with absolutely identical types of parameters in the first case, LEAST returned "11", and in the second - 9.
But you can do even better:
mysql> SELECT NULLIF(LEAST("9", "11"), "9") + 0;
+-----------------------------------+
| NULLIF(LEAST("9", "11"), "9") + 0 |
+-----------------------------------+
| 9 |
+-----------------------------------+
1 row in set (0.00 sec)
In this query, LEAST returned something other than string "9" (otherwise NULLIF would return NULL), but it returned string "9" at the same time!
If you look at the code, then this is really what happens. LEAST is executed twice, comparing the parameters for the first time as strings, and the second time as integers.