Goodbye, MongoDB, Hello, PostgreSQL

Original author: Yorick Peterse
  • Transfer
Our startup Olery was founded almost 5 years ago. We started with the only product, Olery Reputation, which was created by a Ruby development agency. All this has grown into a set of different products. Today we also have Olery Feedback, the API for Hotel Review Data, widgets for embedding on websites and much more.

In total, we have 25 applications (all in Ruby) - some of them on the web (Rails or Sinatra), but mostly these are background applications for data processing.

Although we have something to be proud of, we have one problem that all the time hung somewhere in the background - the database. Initially, we used MySQL for important data (users, contracts, etc.) and MongoDB to store reviews and other data that could easily be restored in case of loss. At first, everything worked well, but as we grew, we began to experience problems, especially with MongoDB. Some of them arose in the field of interaction between the database and applications, some - directly at the database itself.

For example, at some point we needed to delete a million documents from MongoDB, and then insert it. As a result, the work of the base stalled for several hours. Then we had to run repairDatabase. And the repair itself also took several hours.

Another time, we noticed the brakes and determined that the cause was the MongoDB cluster. But we could not figure out what exactly slows down in the database. No matter what tools we have tried to debug and collect statistics. We had to (here I do not know how to translate, tell me in the comments: "until we replaced the primaries of the cluster"), so that the performance will return to normal.

These are just two examples, and there were many. The problem was not only that the database was capricious, but also that we could not understand why this was happening.

The problem of the lack of circuit

Another fundamental problem we encountered is the main feature of MongoDB, namely, the lack of a circuit. In some cases, this provides benefits. But in many cases, this leads to the problem of implicit schemes. They are determined not by the data storage engine, but based on the behavior of applications and forecasts.

For example, you may have a set of pages in which the application expects to find a title field of type string. There is a scheme, although not explicitly defined. Problems begin if the data structure changes over time, and the old data is not transferred to the new structure (which is quite difficult to do in the case of circuit-free databases). Let's say you have a code like this:

post_slug = post.title.downcase.gsub(/\W+/, '-')

This will work for all documents that have a title field that returns String. If the documents have a field with a different name, or no string field at all, this will break. To handle such cases, you need to rewrite the code:

if post.title
  post_slug = post.title.downcase.gsub(/\W+/, '-')
  # ...

Another way is to set the schema in the database. For example, Mongoid, the popular MongoDB ODM for Ruby, allows this. But why specify a schema through such tools, if you can specify a schema in the database itself? This would be reasonable for reuse. If you have only one application working with a database, this is not scary. And if there are a dozen of them, then all this quickly turns into a mess.

Storing data without schemas is designed to make your life easier. You do not need to come up with schemes. In fact, in this case, you shift the responsibility for maintaining data connectivity. In some cases, this works, but I bet that in most cases it only causes more difficulties.

Good DB Requirements

And we come to the question of what a good DB should be. At Olery, we value the following:

- connectivity
- data and system behavior are visible from the outside
- correctness and ambiguity
- scalability

Connectivity helps the system ensure that what is expected of it is fulfilled. If the data is always stored in a certain way, then the system becomes simpler. If any field is necessary, then the application should not check for its presence. The BB should guarantee the completion of certain operations even under load. There is nothing more disappointing than inserting data and expect them to appear in the database within a few minutes.

External visibility describes both the system itself and how easy it is to extract data from it. If the system is buggy, it should just be debugging. Data requests should also be simple.

Correctness implies that the system lives up to expectations. If the field is defined as numeric, you should not be able to insert text there. MySQL is extremely weak in this regard. You can insert text into a number field and get some nonsense in the data.

Scalability is important not only for speed, but also from a financial point of view, and in terms of how the system responds to changing requirements. The system should work well without unreasonable financial costs, and not slow down the development cycle of systems that depend on it

Avoiding MongoDB

After thinking about all this, we went in search of a replacement for MongoDB. Since our queries clearly suited traditional relational databases, we looked at two candidates: MySQL and PostgreSQL.

MySQL was the first, in particular because we already used it in some cases. But he also has his own problems. For example, setting the field as int (11), you can paste the text there, and MySQL will try to convert it. Examples:

mysql> create table example ( `number` int(11) not null );
Query OK, 0 rows affected (0.08 sec)
mysql> insert into example (number) values (10);
Query OK, 1 row affected (0.08 sec)
mysql> insert into example (number) values ('wat');
Query OK, 1 row affected, 1 warning (0.10 sec)
mysql> insert into example (number) values ('what is this 10 nonsense');
Query OK, 1 row affected, 1 warning (0.14 sec)
mysql> insert into example (number) values ('10 a');
Query OK, 1 row affected, 1 warning (0.09 sec)
mysql> select * from example;
| number |
|     10 |
|      0 |
|      0 |
|     10 |
4 rows in set (0.00 sec)

Although MySQL generates warnings, warnings are often simply ignored.

The second problem is that any change in the table leads to its reading and writing. So, after each change operation, we have to expect its end. For large tables, this can take hours, which will slow down the entire project. Companies like SoundCloud have developed special tools like lhm because of this.

Therefore, we began to look closely at PostgreSQL. She has many advantages. For example, you cannot insert text in a numeric field:

olery_development=# create table example ( number int not null );
olery_development=# insert into example (number) values (10);
olery_development=# insert into example (number) values ('wat');
ERROR:  invalid input syntax for integer: "wat"
LINE 1: insert into example (number) values ('wat');
olery_development=# insert into example (number) values ('what is this 10 nonsense');
ERROR:  invalid input syntax for integer: "what is this 10 nonsense"
LINE 1: insert into example (number) values ('what is this 10 nonsen...
olery_development=# insert into example (number) values ('10 a');
ERROR:  invalid input syntax for integer: "10 a"
LINE 1: insert into example (number) values ('10 a');

PostgreSQL also has the ability to modify tables, which does not lead to their locking. For example, the operation of adding a column that does not have a default value and that can be filled with NULL does not lock the table.

There are other interesting features, namely: trigram-based index and search, full-text search, support for JSON requests, support for requests and storage of key-value pairs, pub / sub support, and much more.

And most importantly, PostgreSQL has a balance between speed, reliability, correctness and connectivity.

Transition to PostgreSQL

So, we decided to stop at PostgreSQL. The migration process from MongoDB was not an easy task. We divided it into three stages:

- preparation of the PostgreSQL database, migration of a small part of the data
- updating applications that work with MongoDB to work with PostgreSQL, including any refactoring
- production migration to a new database and placement on a new platform

Migrating a small piece of data

Although there are tools for migration, due to the nature of our data, we had to make such tools ourselves. These were one-time Ruby scripts, each of which was engaged in a separate task - transferring reviews, cleaning encodings, editing main keys, and more.

Application update

Most of the time was spent updating applications, especially those that were highly dependent on MongoDB. It took several weeks. The process was as follows:

- replacing the driver / code / model of MongoDB with the code for PostgreSQL
- running the tests
- fixing the tests
- repeat step 2

For applications that did not work on Rails, we stopped using Sequel. For Rails they took ActiveRecord. Sequel is a handy toolkit that supports almost all of PostgreSQL's special features. In the construction of queries, he outperforms ActiveRecord, although in places it turns out too much text.

For example, you need to calculate the number of users using a certain locale as a percentage of the total. In plain SQL, a query like this might look like this:

SELECT locale,
count(*) AS amount,
(count(*) / sum(count(*)) OVER ()) * 100.0 AS percentage
FROM users
GROUP BY locale
ORDER BY percentage DESC;

In our case, we get the following result:

 locale | amount |        percentage
 en     |   2779 | 85.193133047210300429000
 nl     |    386 | 11.833231146535867566000
 it     |     40 |  1.226241569589209074000
 de     |     25 |  0.766400980993255671000
 ru     |     17 |  0.521152667075413857000
        |      7 |  0.214592274678111588000
 fr     |      4 |  0.122624156958920907000
 ja     |      1 |  0.030656039239730227000
 ar-AE  |      1 |  0.030656039239730227000
 eng    |      1 |  0.030656039239730227000
 zh-CN  |      1 |  0.030656039239730227000
(11 rows)

Sequel allows you to write such a request in pure Ruby without string fragments (which are sometimes required for ActiveRecord):

star = Sequel.lit('*')
    .select_append { count(star).as(:amount) }
    .select_append { ((count(star) / sum(count(star)).over) * 100.0).as(:percentage) }

If you do not want to use Sequel.lit ('*'), you can do this:
    .select_append { count(users.*).as(:amount) }
    .select_append { ((count(users.*) / sum(count(users.*)).over) * 100.0).as(:percentage) }

The text is a bit much, but they can be easily redone in the future.

In the plans we want to transfer applications working with Rails to Sequel. But it’s not yet clear whether it’s worth the time spent.

Work data migration

There are two ways to do this:

- stop the entire project, migrate data, pick up the project
- migrate in parallel with a working project

The first option implies that the project will not work for some time. The second is complicated in execution. It is necessary to take into account all the new data that will be added during the migration, so as not to lose them.

Fortunately, everything is so cleverly set up in Olery that database operations occur at approximately equal intervals of time. Data that changes more often is easier to transfer because it is very small.

The plan is this:

- transfer critical data - users, contracts, etc.
- transfer less critical data (which can then be recounted or restored)
- check that everything works on a set of separate servers
- transfer production to new servers
- transfer all critical data that appeared from the first step The

second step took a day. The first and fifth - 45 minutes each.


Almost a month has passed since we transferred all the data, and everything suits us. Changes have occurred only for the better. Increased application performance. API review time has decreased.


We moved on January 21. The peak in the graph is the restart of the application, which led to a temporary increase in response time. After the 21st, the response time was almost halved.

Where we also noticed a serious increase in performance is in the application, which saved data from reviews and ratings.


Our data collectors have also accelerated.


The difference turned out not so strong, but collectors do not use the database so much.

And finally, the application that distributes the data collectors work schedule (“scheduler”):


Since it works at certain intervals, the graph is difficult to read, but in general there is a clear decrease in response time.

So, we are quite satisfied with the results of the move, and we are not going to miss MongoDB. The performance is excellent, the tools for working with the database are very convenient, querying the data has become much easier, in comparison with MongoDB. The only service that still uses it is Olery Feedback. He works on his own separate cluster. But we are also going to transfer it to PostgreSQL in the future.

Also popular now: