zabivator January 27, 2015 at 18:21

PostgreSQL vs MySQL

In anticipation of my presentation at the PGCONF.RUSSIA 2015 conference, I will share some observations about the important differences between MySQL and PostgreSQL. This material will be useful to all those who are no longer satisfied with the capabilities and features of MySQL, as well as those who are taking their first steps in Postgres. Of course, you should not consider this post as an exhaustive list of differences, but to make a decision in favor of one or another DBMS, it will be quite enough.

Replication

The topic of my talk is “Uncensored Asynchronous Replication, or Why PostgreSQL Will Conquer the World”, and replication is one of the most painful topics for busy projects using MySQL. There are many problems - correct operation, stability, performance - and at first glance they look unrelated. If we look in the historical context, we get an interesting conclusion: MySQL replication has so many problems because it was not thought out, and the point of no return was support for the storage engine (plug-in engines) without answers to the questions “what about the journal?” And “ how different storage engines participate in replication. " In 2004, in the PostgreSQL mailing list, the user tried to “find” the storage engine in the PostgreSQL source code and was very surprised that they were not there. During the discussion, someone suggested adding this feature to PostgreSQL,

The problem is that many storage management systems ... often do their own WAL and PITR. Some do their own buffer management, locking and replication / load management too. So, as you say, its hard say where an interface should be
abstracted.

link to this letter in postgresql mailing list

More than 10 years have passed, and what do we see? MySQL has annoying problems with transactions between tables of different storage engines, and MySQL has replication problems. Over the past ten years, PostgreSQL has got plug-in data types and indexes, and it also has replication - that is, the advantage of MySQL was leveled, while the architectural problems of MySQL remained and prevented from living. In MySQL 5.7, they tried to solve the replication performance problem by parallelizing it. Since a project at work is very sensitive to replication performance due to its size, I tried to test whether it got better. I found that concurrent replication in 5.7 works slower than single-threaded in 5.5, and only in some cases - about the same. If you are currently using MySQL 5.5 and want to upgrade to a more recent version, please note

After the report on highload, Oracle took note of the test I developed and reported that they would try to fix the problem; recently they even wrote to me that they were able to see parallelism on their tests, and sent the settings. If I am not mistaken, with 16 threads there was a slight acceleration compared to the single-threaded version. Unfortunately, I still have not repeated my tests on the provided settings, in particular because with such results our problems still remain relevant.

The exact reasons for this performance regression are unknown. There were several assumptions - for example, Christian Nelsen, one of the developers of MariaDB, wrote on his blogthat there may be problems with the performance scheme, with thread synchronization. Because of this, a regression of 40% is observed, which is visible in conventional tests. Oracle developers refute this, and they even convinced me that it is not there, apparently, I see some other problem (and how many of them all?).

In MySQL replication, storage engine problems are exacerbated by the selected level of replication — they are logical, while in PostgreSQL they are physical. In principle, logical replication has its own advantages, it allows you to do more than any interesting things, I will also mention this in the report. But PostgreSQL, even as part of its physical replication, is already nullifying all these benefits. In other words, almost everything that is in MySQL can already be done in PostgreSQL (or it will be possible in the near future).

You can not rely on the implementation of low-level physical replication in MySQL. The problem is that there instead of one log (as in PostgreSQL) there are two or four of them - depending on how to count. PostgreSQL simply commits queries, they get to the log, and this log is used in replication. PostgreSQL replication is superstable because it uses the same logbook as during crash recovery operations. This mechanism has long been written, well tested and optimized.

In MySQL, the situation is different. We have a separate InnoDB journal and replication journal, and we need to commit both there and there. This is a two-phase commit between logs, which by definition is slow. That is, we can’t just take it and say that we are repeating the transaction from the InnoDB log - we have to figure out what kind of request it is and run it again. Even if this is logical replication, at the level of lines, then these lines need to be searched in the index. And not only do you have to do a lot of work to complete the request - it will again be written to its InnoDB log on a replica, which is clearly not good for performance.

In PostgreSQL, in this sense, the architecture is much more thought out and better implemented. Recently, it announced an opportunity called Logical Decoding - which allows you to do all sorts of interesting things, which are very difficult to do in the framework of a physical journal. In PostgreSQL, this is an add-on from above, logical decoding allows you to work with the physical log as if it were logical. It is this functionality that will soon take away all the advantages of MySQL replication, except, perhaps, the size of the log - statement-based MySQL replication will win - but statement-based MySQL replication has absolutely wild problems in the most unexpected places, and you should not consider it a good solution (about that’s all I will also say in the report).

PostgreSQL also has trigger replication - this is Tungsten, which allows you to do the same. Trigger replication works as follows: triggers are set, they fill in tables or write files, the result is sent to the replica and applied there. It is through Tungsten, as far as I know, that they migrate from MySQL to PostgreSQL and vice versa. In MySQL, logical replication works right at the engine level, and you can’t do another one now.

Documentation

PostgreSQL has much better documentation. In MySQL, it formally seems to even exist, but the meaning of individual options can be difficult to understand. It seems that it’s written what they are doing, but in order to understand how to configure them correctly, you need to use informal documentation and look for articles on these topics. Often you need to understand the architecture of MySQL, without this understanding, the settings look like some kind of magic.

For example, the Percona company “fired” this way: they ran the MySQL Performance Blog, and there were many articles on this blog that addressed particular aspects of MySQL operation. This brought wild popularity, led clients to consulting, and allowed us to attract resources to start developing our own fork of Percona-Server. The existence and relevance of the MySQL Performance Blog proves that official documentation is simply not enough.

PostgreSQL has virtually all the answers in the documentation. On the other hand, I have heard a lot of criticism when comparing PostgreSQL documentation with the "adult" Oracle. But this is, in fact, a very important indicator. No one tries to compare MySQL with an adult Oracle at all - that would be ridiculous and ridiculous - and PostgreSQL is already starting to compare quite seriously, the PostgreSQL community hears this criticism and is working on improving the product. This suggests that, in its capabilities and performance, it begins to compete with such a powerful system as Oracle, which runs mobile operators and banks, while MySQL remains in the niche of websites. And giant projects that have grown to a large amount of data and users slurp the grief with MySQL with a big spoon, constantly resting on its limitations and architectural problems,

An example of such large projects on PostgreSQL is 1C: PostgreSQL comes as an option instead of Microsoft SQL, and Microsoft SQL is really a fantastic DBMS, one of the most powerful. PostgreSQL can replace MS SQL, and trying to replace it with MySQL ... let's lower the veil of pity over this scene, as Mark Twain wrote.

Standards

PostgreSQL complies with SQL-92, SQL-98, SQL-2003 standards (all its reasonable parts are implemented) and is already working on SQL-2011. It is very cool. By comparison, MySQL does not even support SQL-92. Someone will say that in MySQL this goal was simply not set by the developers. But you need to understand that the difference between versions of the standard is not in small changes - these are new functionalities. That is, at the moment when MySQL said: “We will not follow the standard”, they not only made some minor differences, which made MySQL difficult to maintain, they also closed the way to the implementation of many necessary and important features. There is still no normal optimizer. What is called optimization in PostgreSQL is called a “parser” plus normalization. In MySQL, this is just a query execution plan, without separation. And MySQL will come to support standards very soon, because they are under the pressure of backward compatibility. Yes, they want to, but in about five years, maybe something will come up with them. PostgreSQL has it all now.

Performance and administration complexity

In terms of ease of administration, the comparison is not in favor of PostgreSQL. MySQL is much easier to administer. And not because in this sense it is better thought out, but simply much less able to do. Accordingly, it is easier to configure it.

MySQL has a problem with complex queries. For example, MySQL does not know how to lower grouping into separate parts of union all. The difference between the two queries - in our example, grouping by separate tables and union all from above worked 15 times faster than union all and then grouping, although the optimizer should bring both queries into the same, effective plan for query execution. We will have to do the generation of such queries by hand - that is, spend the developers time on what the base should do.

The "simplicity" of MySQL stems, as can be seen above, from extremely poor capabilities - MySQL works just worse and requires more time and effort during development. In contrast, PostrgreSQL has histograms and a normal optimizer, and it will execute such queries efficiently. But if there are histograms, then there are their settings - at least the bucket size. You need to know about the settings and in some cases change them - therefore, you need to understand what kind of setting it is, what it is responsible for, be able to recognize such situations, see how to select the optimal parameters.

Occasionally, PostrgreSQL's skill can hinder rather than help. In 95% of cases, everything works well - better than MySQL - and some one stupid query is much slower. Or everything works well, and then suddenly (from the point of view of the user), as the project grew, some requests began to work poorly (there was more data, another plan for executing the request began to be selected). Most likely, to fix it, just run analyze or twist the settings a little. But you need to know what to do and how to do it. At a minimum, you need to read the PostgreSQL documentation on this topic, and for some reason they do not like to read the documentation. Maybe because MySQL has little help from it? :)

I emphasize that PostgreSQL in this sense is no worse, it just allows you to postpone problems, and MySQL immediately throws them out and you have to spend time and money to solve them. In this sense, MySQL always works stably badly, and even at the development stage, people take these features into account: they do everything in the simplest way possible. This applies only to productivity, more precisely, to methods of achieving it and to its predictability. In terms of correctness and convenience, PostgreSQL is a cut above MySQL.

So what to choose?

To decide between MySQL and PostgreSQL for a particular project, you first need to answer other questions.

First, what experience does the team have? If the whole team has 10 years of experience working with MySQL and you need to start as quickly as possible, then it is not a fact that it is worth changing a familiar tool to an unfamiliar one. But if the timing is not critical, then you should try PostgreSQL.

Secondly, we must not forget about the problems of operation. If you do not have a highly loaded project, then in terms of performance, there is no difference between the two DBMSs. But PostgreSQL has another important advantage: it is more strict, does more checks for you, gives less opportunity to make mistakes, and this is a huge advantage in the future. For example, in MySQL you have to write your own tools to verify the normal referential integrity of the database. And even this can be a problem. In this sense, PostgreSQL tool is more powerful, more flexible, it is more pleasant to develop on it. But this largely depends on the experience of the developer.

To summarize: if you have a simple online store, there is no money for the admin, there are no serious ambitions to grow into a big project and you have experience working with MySQL, then take MySQL. If you assume that the project will be popular, if it is large, it will be difficult to rewrite it, if it has complex logic and relations between tables - take PostgreSQL. Even out of the box, it will work for you, help in development, save time, and it will be easier for you to grow.

Tags: