PostgreSQL evangelist memo: criticizing MySQL competently
Hello, Habr! This post is an attempt to dispel some popular myths and legends about MySQL. I was not mistaken with the hub, because the reason for writing was the publication of varanio PostgreSQL features that are not in MySQL, and vice versa from here. The publication itself, in terms of criticism of MySQL, although imperfect, is quite correct, but the comments on it suggest sad thoughts.
Generally speaking, I was going to write a post about MySQL features that are not implemented or are worse implemented in PostgreSQL. But in order not to interfere with a lot of topics in one publication, and considering the rather hard work compared to what I know very well (MySQL) and what I know very poorly (PostgreSQL), I decided to postpone this publication until later To begin with, immediately answer many of the comments from the varanio publication .
Why do I need it? Well, firstly, on the Internet again, someone is wrong . That the problem is not big in itself, but unfortunately I have to meet many of these cool stories not only on the Internet, but also hear at various Russian conferences, including at elite conferences from elitespeakers. It is especially strange to hear not quite correct statements about MySQL from representatives of the Postgres Professional company, to which I am very sympathetic and wish her all kinds of success and prosperity. Therefore, as part of a cultural exchange and to increase the technical level of the discussion about the advantages and disadvantages of PostgreSQL and MySQL, for a start I would like to not only analyze typical misconceptions, but also propose the correct wording and also outline really serious (in my opinion) problems in MySQL on this moment.
So, the school of the young caretaker. Lesson one: "We criticize MySQL competently." Tink!
How not to criticize MySQL?
Let's start with the simplest myths: near-technical.
"MySQL is legacy"
I’ll just give a short and far from complete list of well-known companies where MySQL plays key roles in infrastructure and business: Github, Wikipedia, Google, Facebook, Twitter, LinkedIn, Alibaba, Taobao, Booking.com, AirBnB, Dropbox, Pinterest, GroupOn , Yelp.
I know that there are large and well-known projects on PostgreSQL. But, firstly, few can compare in scale with these projects. And secondly, this list is simply useful to remember, so as not to lose touch with reality when someone gives out something thoughtful in the style of "MySQL is not needed."
“MySQL has no community!”
The criteria for assessing the extent and activity of the community can be argued for a long time, but the criteria at http://db-engines.com/en/ranking seem pretty reasonable to me .
On the other hand, in Russia, the popularity of PostgreSQL and the activity of the community are indeed slightly higher than the “world average” trends. I don’t quite understand what caused this, but I don’t see anything wrong with that.
“MySQL has a community, but it's fragmented!”
In general, I never could understand what this is about. It sounds like Oracle MySQL users need to be completely retrained to work with MariaDB. Or the DBA working with MariaDB knows absolutely nothing about Percona Server. In reality, 99% of the skills, books, articles, utilities, tips obtained somewhere, etc. can be safely used on any version of MySQL.
"MySQL has a lot of forks and they have a mess"
For the first time I heard this strange statement several years ago at a Russian conference. Since then, I often hear this from people who are clearly poorly aware of what they are talking about. Including in the mentioned post varanio . I could never get clear answers to two simple questions:
- What is the "confusion"?
- How many MySQL forks can you count?
It was not possible to get clear answers this time either, but I no longer really hoped. In the development of free and open source software, “forks” are encountered all the time. Moreover, it is usually considered that it is good, for that it is free and open. In addition, MySQL has even fewer forks and brunches than many other well-known projects. Less than PostgreSQL itself , for that matter.
By the way, this argument is strikingly similar to Microsoft's FUD . This is how Microsoft touted its edge over Linux at the beginning of this century:
"MySQL belongs to Oracle (and PostgreSQL to nobody)"
This, of course, is true, but what important conclusions can be drawn from this, I do not quite understand, and the flight of thought of critics usually stops there.
Purely theoretically, Oracle has the right to minimize the project / close the source / make it paid. How real it is - everyone can decide for himself. For me, this is absolutely unrealistic - this could be feared at the time of the purchase of Sun Microsystems by Oracle, but 5 years have passed (Karl!) And during this time MySQL developed faster than ever before the purchase. In addition, there are no fools sitting in Oracle, and they are well aware that closing MySQL will hit Oracle itself most of all.
“Subjectively, PostgreSQL has fewer bugs”
Maybe it is, but how would you compare objectively in the absence of a tracker bug in PostgreSQL?
On this with non-technical myths we end and move on to more substantial ones.
"PostgreSQL is many times faster than MySQL"
For many years I have been optimizing MySQL code and, as a result, often do load tests, though mostly between different versions and versions of MySQL. Here is what I can say about such statements: I am absolutely sure that there are loads / queries / configurations where PostgreSQL will work many times faster than MySQL, just as I am sure of the opposite - there are loads / queries / configurations in which MySQL will be several times faster than PostgreSQL. In the case of MySQL, there is another important criterion - the used engine (storage engine), which can also change the situation not only at times, but also by orders of magnitude.
Every time someone complains about performance (it doesn't matter, MySQL or PostgreSQL), they ask him to show the schema / queries / explain / configuration. And this is no coincidence - all talk about performance can be conducted only in this context. General statements are not quoted.
“Replication in PostgreSQL is smart. But in MySQL there is no "
MySQL implements logical replication (regardless of statement-based or row-based formats). In PostgreSQL, replication is physical (not “binary,” but physical).
Each approach has its pros and cons. You don’t even need to be a specialist in one or another DBMS in order to understand this. MySQL users have gained great experience in circumventing or smoothing out the negative effects of logical replication, but most likely someday physical will appear, because one and the other are needed. I suspect that a similar process is going on in PostgreSQL, but in the opposite direction.
This topic is very extensive and raises so many questions that I am thinking about a separate post.
“MySQL is loosely working with data”
Options: "MySQL quietly divides by zero!"
A technically correct, but boring version of this statement is: “In MySQL versions <5.7, you must remember to enable the correct SQL modes ”
In 5.7, the “strictness” settings are pretty reasonable by default. The reason why this has not been done for so long is banal: as for any popular project, the main competitor of MySQL is its own old versions. We have to pull backward compatibility with old crookedly written applications. In 5.7 they decided to end this, which is good.
“MySQL has nontransactional engines like MyISAM”
Yes, but it’s not very clear what the criticism is. Talking about MyISAM in 2015 is only possible in terms of supporting legacy applications. Honestly, the last time I met a live MyISAM user five years ago.
Although there are nontransactional engines, from which no transaction is required. For example CSV . To make it easier for a PostgreSQL user to understand, the CSV engine is something like file_fdw . It is more likely to exchange data with other applications, and not for actually storing data.
“In MySQL, transactions are somehow screwed to the side ”
By this, everyone understands something of their own. Someone thinks so, because you cannot wrap DDL in a transaction. Which, of course, is true, but the problem does not grow from “transactions from the side”, but from DDL curves (more on this later in the section “How to criticize MySQL”).
A boring, technically correct option: “MySQL has non-transactional DDLs.”
Someone says that an error inside a transaction does not automatically roll back. Yes, there are different errors (for example, lock wait timeout), and for them the application has the ability to repeat the last statement, and not roll back the entire transaction. As far as I know, Oracle and SQL Server behave the same by default - this is not a violation of any standards. In PostgreSQL, to be able to repeat the last statement in case of an error, you would have to wrap each statement inside the transaction in SAVEPOINT. Which, for example, led to the emergence of ON_ERROR_ROLLBACK for psql. Those. As usual, the approaches are different, with their pros and cons.
“MySQL has very expensive DDLs”
This refers to re-creating the table for certain operations, for example, when deleting constraint. Already in 5.6, you can do ALTER without re-creating the table for almost all operations . In particular, constraint can be removed without re-creating. In 5.7, the list of in-place operations is even wider.
I don’t know how PostgreSQL deals with this (a quick search shows that there are certain problems). But for MySQL there is a utility pt-online-schema-change , which allows you to bypass many of the limitations of ALTER TABLE in MySQL and which you often don’t know or forget about.
“MySQL has some kind of wrong MVCC”
InnoDB’s MVCC engine is well implemented. From the user's point of view, it is almost identical to the MVCC implementation in Oracle, and is very similar to the implementation in PostgreSQL. But differences in the behavior of many people are surprised that it gives rise to similar myths. The fact is that the standard very indistinctly defines many subtle moments at different levels of isolation. As a result, each DBMS interprets these “gaps” in its own way. A good comparison of MVCC in Oracle, PostgreSQL, and in MySQL / InnoDB can be found here . But even there, not all the nuances are reflected.
Over the past 10 years, a tremendous amount of effort has been put into optimizing and scalable MVCC at InnoDB. I also attached to this.
Updated 02.04.2017: a more formal attempt to understand the differences in the implementation of MVCC in different DBMSs, including PostgreSQL and MySQL.
“Because of the plug-in engines, MySQL writes data to disk 2, 3, 4, and 5 times in wartime”
Fantastically naive statement. The logic is simple:
- write to data files
- write to transaction log
- binary logging
- and there’s some kind of doublewrite buffer, which, judging by the name, writes two more times!
Here we need to talk about the "write amplification" feature. I could not find a good term in Russian, but in essence it is the ratio of the total amount of data written to disk to the total amount of data transmitted by the client. Let's see what it consists of.
Firstly, writing to data files occurs only when resetting the updated pages to disk. How long a page can be updated in memory before it is flushed to disk (and accordingly, how much write amplification will change) depends on a large number of parameters: the size of the shared buffer (buffer pool in InnoDB), the size of the transaction log, the algorithm that controls page flushing server settings and of course the type of load. This all applies to both MySQL and PostgreSQL.
Secondly, the transaction log contains not only updates to the records in the table, but also all physical changes in the data files. Using InnoDB as an example, this is managing index trees (splitting pages when filling, combining when deleting, rebuilding a tree, etc.), deleting old versions of records (purge operation), operations on change buffer and other internal accounting. This also applies to both MySQL and PostgreSQL.
Thirdly, the binary log contains only logicalchanges in the data: he does not “know” anything about the format of data files and all their internal accounting. Write amplification for a binary log depends on many parameters (statement / row-based, binlog_row_image and other settings). In addition, the binary log can be disabled - it is needed often, but not always.
Well, finally, doublewrite buffer is also used only when a page is flushed to disk, and not every INSERT / DELETE / UPDATE. An absolutely similar mechanism, only in profile, is called “full page write” in PostgreSQL. And there and there it can be turned off under certain conditions.
It is also necessary to take into account the redundancy of data formats (i.e. overhead costs for service information), page size (you can specify it in InnoDB when creating the database), compression, and much more.
I hope from all this it is clear that it is impossible to calculate the write amplification characteristic - there are too many parameters. It can be measured for a specific load, a specific configuration, and (in the case of MySQL) a specific engine. For example, in the case of recording-optimized engines like TokuDB or MyRocks, this characteristic will be much lower than in InnoDB, because they were created for this.
All that can be said is that binary log (when enabled) leads to additional write overhead. How much the costs are, and whether the total write amplification will be greater than the same characteristics in PostgreSQL, can not be said even without approximately measuring specific loads and configurations.
Such a boring truth.
“MySQL has too many logs”
I will not paint all types of magazines; their purpose is more or less clear to everyone. The questions are mostly raised by the binary log. Binary log is a "fee" for the ability to have plug-in engines. “Connected” not in the sense that they can connect dynamically as plugins (there wasn’t such an opportunity before), but in the sense that there can be several of them at all, and the server works with different engines via the API.
Since the physical representation of the data on the disk can vary depending on the engines used, and a single serialized representation of all the changes is necessary for replication, such a representation is written in a logical form abstracted from the physical data. As I already wrote, a logical log does not necessarily duplicate entries in physical logs when updating data.
With a logical log, as with logical replication, you can do all sorts of interesting things. In particular, it is actively used in Galera technology. I am going to give a brief overview of all this economy in the next post.
“MySQL has a bad console client”
As I understand it, this means the absence of context-sensitive autocompletion. I agree, it can be convenient. But in order to feel this, you need to spend a lot of time in the console client. As a developer, I spend more time in the editor. It seems to me that most of the work of DBA is also performed by scripts, and not manually in the console client.
But for those who really need it, the mycli project has recently appeared with smart auto-completion and even syntax highlighting. And for lovers of beautiful GUIs there is MySQL Workbench, where in my opinion there is everything at all . Never used one or the other.
Updated 05.10.2017: In the beta version of MySQL Shell, Oracle’s new command-line client, has added support for auto-completion for SQL.
In any case, it would be more correct to say “there is no contextual autocompletion in the MySQL console client”, because all the criticism seems to come down to this.
How could you criticize MySQL, but no longer relevant?
Separately, I would like to highlight statements that are generally true, but lose relevance in the light of the imminent release of MySQL 5.7.
"There is no JSON support in MySQL"
Limited support was previously in svetasmirnova's JSON UDF . There is native support in 5.7. How comparable is it to PostgreSQL, I can’t say, but in any case, the statement in this form is already outdated.
“There are no functional indexes in MySQL”
In 5.7, functional indexes are implemented as indexable virtual / generated columns. They have been in MariaDB for a long time, but indexing is only possible for materialized virtual columns.
How to criticize MySQL?
We turn to probably the most interesting part. There really are real and serious problems, but it’s very rarely said about them in holivors. In fact, judging by blogs and reports at conferences, MySQL users are more interested in not the absence of, say, window functions, but the issues of horizontal scaling, sharding, clustering, high availability, cloud platforms and related issues of automation, monitoring, information protection and other things. As a result, it is in these areas that the efforts of developers are concentrated.
But, since we are talking here about criticism from the PostgreSQL community, but as a rule it does not touch on all these issues, here I will list the fair, in my opinion, critical comments from those heard or read earlier, but formulated technically correctly.
Lack of transactional data dictionary
In MySQL, the “data dictionary” is a collection of nontransactional files (.frm, .par, etc.). This is a difficult legacy from the most ancient times and it creates a huge number of problems.
This is not only the absence of transactional DDLs, for which a data dictionary is a prerequisite. These are problems with expensive queries to INFORMATION_SCHEMA, problems with extensibility of the metadata format, problems with physical backups that are forced to block the server in essence only to ensure the consistency of transactional data with a non-transactional dictionary, and others.
Work on all of this has begun , but we will definitely not see the results in 5.7.
For Oracle, nothing but InnoDB exists
Unfortunately, Oracle is not particularly worried about the fate of third-party engines for obvious reasons. I do not think that the concept of “plug-in” engines will be phased out in the near future, but they are clearly not going to take care of the support of other engines.
As a result, many features are only implemented in InnoDB. Examples: foreign keys, full text search, spatial indexes, virtual columns. It is unlikely that we will see all these functions soon in TokuDB or MyRocks. And if they are implemented, they will not be compatible in functionality with that of InnoDB.
Incomplete query optimizer
The optimizer has always been one of the weaknesses of MySQL. Although I suspect that if you start detailed discussions, then most critics will begin to talk about problems fixed 5-10 years ago. There are useful improvements in each new version, but still progress is slower than many would like. There are historical reasons for this, although the point here is not so much in the concept of plug-in engines as in the very imperfect architecture of the optimizer. It requires serious revision and refactoring, and work in Oracle is going on, but such things cannot be done quickly.
Weak support for advanced SQL
This problem is partly related to the previous one. But yes, there are no window functions, CTE and much, much more. How important this is for MySQL users is an open question. As I already wrote, you can’t say for blogs and reports at conferences that SQL functionality is really lacking. Judging by the changes in MySQL releases, for Oracle this is also far from the highest priority.
Updated 05.10.2017: Support for CTE and window functions appeared in MySQL 8.0. Interestingly, according to the test results of the famous PostgreSQL evangelist Marcus Vinanda, MySQL currently has the most comprehensive CTE support among all popular DBMSs, including PostgreSQL.
Weak GIS Support
This is also true criticism, but in my opinion at Oracle this is taken seriously. In 5.7, GIS was rewritten from scratch, spatial indexes were added to InnoDB, Boost.Geometry was used instead of self-written code, and work began on compliance. I suspect that in future versions there will be a lot of interesting things.
Updated 05.10.2017: GIS support has been greatly expanded in MySQL 8.0. Of the key points - support for non-Cartesian SRS and an expanded set of functions for processing spatial data. Details can be found in this presentation . It would also be interesting to compare this with PostGIS. Judging by my communication with people versed in this topic, at some points PostGIS is already inferior in capabilities to the built-in GIS support in MySQL 8.0.
Conclusion
Surely I missed something in each section, but I am sure that they will supplement it in the comments . It is impossible to cover everything with a single publication, but if someone makes this publication make the criticism of MySQL at Habré and conferences thoughtful and skeptical, I pressed the buttons for a reason. To write separately about the logical / physical replication and functions of MySQL that are not implemented in PostgreSQL or are implemented worse, I would be interested, and I would probably risk it if, of course, they do not bother me completely.