dmitriy_b August 7, 2014 at 10:47

Why did we choose MongoDB

This article was born after reading the material “Why You Should Never Use MongoDB” . Below is a story about how we gradually abandoned MySQL and came to use MongoDB as the main data warehouse.

It all started somewhere in 2008, when it was decided to write a second version of our site. For some time, we wanted to create a multilingual version of the database on games, near-game companies, characters, etc., since the existing solution seemed to us to be morally obsolete.

First things were formulated:

Requirements

Basically, these are the requirements for the database.
Of the significant, we can distinguish:

Requirement 1. Multilingual fields

Each entry can have one or more multilingual fields containing both relatively short names and long descriptions. We thought about different options:

Option 1. Two tables for the entire database

create table name (
  id          int          not null primary key,
  table       varchar(32)  not null,
  field       varchar(32)  not null,
  object_id   int          not null,
  name        varchar(255) not null,
  lang        varchar(2)   not null
)
create table description (
  id          int          not null primary key,
  table       varchar(32)  not null,
  field       varchar(32)  not null,
  object_id   int          not null,
  description text,
  lang        varchar(2)   not null,
)

Accordingly, if a game (table game) has a multilingual name name, an alternative name alt_name and description desc, then we would get, in addition to the game itself, three more entries in the language.

Example entries in the name table:

id | table | field    | object_id | name      | lang
---|-------|----------|-----------------------|-----
1  | game  | name     | $game_id  | $name     | en
2  | game  | alt_name | $game_id  | $alt_name | en

An example entry in the description table:

id | table | field | object_id | description | lang
---|-------|-------|-----------|-------------|-----
1  | game  | desc  | $game_id  | $desc       | en

Also, tables could be combined into one, using the text type for storing names, but I did not like this solution; why - I don’t remember anymore.

Option 2. For each main table - its own multilingual

For the same game table, you get something like the following:

create table game_i18n (
  id          int          not null primary key,
  game_id     int          not null,
  name        varchar(255) not null,
  alt_name    varchar(255) not null,
  description text,
  lang        varchar(2)   not null
)

Record Example:

id | game_id  | name     | alt_name     | desc     | lang
---|----------|----------|--------------|----------|-----
1  | $game_id | $name    | $alt_name    | $desc    | en
2  | $game_id | $name_ru | $alt_name_ru | $desc_ru | ru

Option 3. Save multilingual fields as a json array in a separate field of the main table

create table game (
  id   int not null primary key,
  ...,
  i18n text
)

Record Example:

id | i18n
---|--------------------------------
1  | {'name':[{'lang':'en','value': $name}, {'lang':'ru','value':$name_ru}], 'alt_name': [...], 'desc': [...]}

The third option is the most flexible, but it was almost immediately abandoned, because sorting and filtering by name were needed, and for this, you would still have to do something similar to option 1 or 2. Plus, if you need, say, delete the English name in several games, it will be difficult to do with SQL.

As a result, we settled on option 2. Against the first option, the multilingual fields could have their own additional fields. For example, in games you need the ability to mark one of the names as the main thing, other objects may have their own set of additional fields, which, quite possibly, will also need to be filtered / sorted. I did not want to end up when choosing the first option in a couple of years to come to this:

create table name (
  id           int          not null primary key,
  table        varchar(32)  not null,
  field        varchar(32)  not null,
  object_id    int          not null,
  name         varchar(255) not null,
  lang         varchar(2)   not null,
  field1       int,
  field2       varchar(32),
  ...,
  field9       datetime
)

And then remember in the code what field3 is for a company table. In addition, it is somehow uncomfortable, creating the next table with five records, dump the transfers into a table with a million records. However, the latter appeared in all its glory here:

Requirement 2. Communication

There was a desire to be able to associate any object from any table with any other object while maintaining the direction of communication.
The options are approximately the same as in the first requirement:

Option 1. One table for all database connections

create table link (
  id          int         not null primary key,
  table1      varchar(32) not null,
  object1_id  int         not null,
  table2      varchar(32) not null,
  object2_id  int         not null
)

Example entries:

id | table1  | object1_id | table2 | obect2_id |
---|---------|------------|--------|-----------|
1  | game    | $game_id   | genre  | $genre_id |
2  | game    | $game_id   | game   | $game2_id |
3  | game    | $game2_id  | game   | $game_id  |

Records # 2 and # 3 implement bidirectional communication, record # 1 - unidirectional from game to genre.

Option 2. Each type of communication has its own table

For example, the connections between similar games would be as follows:

create table similar_games_link (
  id       int not null primary key,
  game1_id int not null,
  game2_id int not null
)

and so on for each type of connection.

Option 3. We store the links in the object itself in text form

create table game (
  id    int not null primary key,
  ...,
  links text
)

Example:

id | links                            |
---|----------------------------------|
1  | #game:$game2_id#genre:$genre_id# |

Then it will be possible to search somehow like this:

select id from game where links like '%#game:$game2_id#%'

Option 4. Analogue of option 3, but store json

Example:

id | links                                                                                    |
---|------------------------------------------------------------------------------------------|
1  | [{'table':'game', 'object_id': $game2_id}, {'table': 'genre', 'object_id': $genre_id}]   |

Options 5 and 6. Each type of communication has its own field.

Variation of options 3 and 4, but the connections are scattered in different fields.

create table game (
  id      int not null primary key,
  ...,
  similar text,
  genres  text
)

As a result, we decided to store everything in one table (it’s convenient), plus, in some cases, it would be possible to duplicate information about the relationships in the fields of the object itself (third through sixth options). I did not want to produce my own table for each type of connection, and the third option could help. Of course, doing links like '% # game: $ game2_id #%' is horrible, but I would have survived. They rejected this option because deleting records turned into a non-trivial task. The fourth and sixth options themselves are generally useless.

Requirement 3. Objects with a different set of fields in one table

For example, news, articles and videos you want to keep in one table, because, firstly, they need to be displayed on the site in a common feed in chronological order, and secondly, there is a lot in common between these types of posts (name, creation / change date, text). But there are differences between posts, let's call them metadata. For example, for reviews (one of the subtypes of articles), you can specify ratings by which it would be nice to be able to sort, for the video you can see the resolution of the original video, duration, whether or not the gameplay is shown, etc. Depending on the type of post, the display on the site also changes.

The solutions to how to store metadata are the same as above. A single table is created in which fields are common to all record types. And then there are few options. You can store all metadata directly in the post text (or in a separate text box) with special tags, as is done on Wikipedia, and scatter over related auxiliary tables when saving. You can immediately create auxiliary tables for each of the post types and save the metadata there (we chose this option, especially since for different types of posts different forms of editing were created in the admin panel anyway). You can store metadata in the form of json or any other serialized form (the problems are also the same - the difficulty of changing such serialized data using SQL, plus sorting / filtering).

Requirement 4. Complex Objects

The game can be released on different platforms and for each platform can have different editions. The release on the platform and the publication has a set of fields that matches the game itself. An example of such a field is “Name”, since, for example, the name of the publication for a particular platform may differ from the canonical name of the game. Also, the platform and the publication have a set of fields that are not in the game itself, for example, for the platform it will be the platform itself, for the publication - the date of its release. How to store all this? In the form of three separate tables? By analogy with how storage of objects with a different set of fields in requirement 3 is solved? Or store the game itself as a single record, and all platforms and editions as json in a separate field of this record? And how to edit such a joy? Do three different forms? The same Pac-Man has 27 platforms and more than 30 publications, editing such a monster can turn into torture. And how to show it? For example, to show the publication, you have to download the platform and the game itself for it, because, for example, the publication may not have its own name. Then you need to look at the general name of the game on the platform, and if it is not there, then look at the name of the game itself. At the same time, prescribing the same name to all publications in advance is also not great.

Previously, I settled on almost the same option as in requirement 3 - one table for games. But, since there were only three types of records, it was decided to store the differing fields in the same table and not produce tables for metadata.

MySQL

Having defined the requirements and preliminary solutions, we began to develop the admin panel. And then (however, as always) the problems began. For example, a company may have a name and description. We create one company_i18n table with the name and description fields. So far, so good. In the

HTML editing form, the form:

Название                     Описание
en: Bad Pixel [x]            ru: Тут длинное описание на русском [x]
ru: ООО “Бед Пиксел” [x]     [добавить описание]
[добавить название]

Separately, you specify your own set of names in different languages and your own set of descriptions, but this is not a problem - names and descriptions, when saved, are combined by language and for each language one entry is created in company_i18n.

Entries in company_i18n after saving:

id | lang | name             | description
---|------|------------------|--------------------------------
1  | en   | Bad Pixel        | NULL
2  | ru   | ООО “Бед Пиксел” | Тут длинное описание на русском

Then it turned out that there can be several descriptions in one language, and the name should be strictly one for the language, and we come to something like:

id | lang | name             | description
---|------|------------------|------------
1  | en   | Bad Pixel        | NULL
2  | ru   | ООО “Бед Пиксел” | Описание 1
3  | ru   | NULL             | Описание 2

It doesn’t look very good anymore, especially if you need to delete “Description 2” using SQL - you need to look to see if there is a name in the name field, and if so, update the record, and if not, delete it. Then, the “main” flag appears at the company name, a field appears for corporate names, which may be several in the same language (for different periods of time), and it comes to the understanding that it seems that you will have to store the names and descriptions in different tables.

The bonds between objects almost immediately appeared the strength of communication. This was not a problem, problems began when a different set of additional fields appeared for different types of links. For example, genres can be associated with other genres and be their subgenres, and the connection must be bidirectional: in tags, some tags are characters from another tag-universe, and in games the same company can be both a developer and a publisher. And although the problem can be solved by adding new fields to the link table, it will be more correct to create separate auxiliary tables for individual types of links.

Of course, I understand that life is a pain, and the developer (in particular) must not forget to suffer, so the development continued slowly but surely, and auxiliary tables appeared with enviable regularity. Nevertheless, I wanted to somehow automate the processes of creating such auxiliary tables and assembling the complete object from them. With such thoughts in early 2010, I came across an article “ How FriendFeed Uses MySQL to Store Data Without a Schema ”.

MySQL and data without schema

The idea of making NoSQL on top of MySQL doesn't seem so crazy even now. At that time, MySQL had been developing for years and was a reliable solution for production, and specialized NoSQL solutions were just beginning to appear, and I did not want to make a choice in favor of one thing, in a couple of years I would be left alone with an unsupported product. I think those who, like me at one time, made a bet on prototype.js, will understand me.

At that time, we did not even consider MongoDB for various reasons, in particular, we have not yet recommended it for production (the first production ready release was at the end of the first quarter of 2010). I am still a supporter of this approach: to use relatively well-established solutions for projects and, to a minimum, attract self-made analogues. But then there were no well-established solutions (or it seemed that they weren’t), and for one of my third-party projects I wrote something similar to what FriendFeed had. Do not get me wrong: I’m not proud of it - the idea of doing something my own can be tempting exactly until you have to support it, fix bugs, optimize, develop functionality, adapt to new versions of the language / libraries / used services. The only thing I regret - it was then necessary to download and feel the "mongu", it’s a matter of half an hour, a maximum of an hour, and benefits for years. Actually, this applies to all new technologies: they appear for a reason, and knowledge of modern trends allows you to simply expand your horizons.

So, a library was written for working with data without a schema and storing it all in MySQL.

A brief description of what happened.

Objects were stored in serialized form in the blob field of the entity table. An additional category field was introduced (an analogue of tables in MySQL and collections in MongoDB) so that objects, such as games, companies, etc., could be separated. (from the option when, for example, all messages are stored in a topic object, we almost immediately refused - for messages a category, for a topic), plus two fields - the time the object was created and updated. Since everything was written in Perl, the Storable library was used to serialize data structures (usually a combination of hashes and arrays) from the internal representation to the binary view and vice versa. First of all, this was done because of speed, in json the data was converted an order of magnitude slower. The second - due to a more compact representation compared to json.

As id, uuid was used, 16 bytes of which were encoded using base 64 into a text string 22 bytes long.

It was possible to hang up “triggers” on any changes to an object — functions that changed both the object itself and other objects associated with the current one. For example, when writing a comment, a trigger can change the total number of comments in a topic object.

Since it was impossible to make any queries on the entity table (except for the primary key and category), indexes were introduced - regular MySQL tables that were created based on the fields of the stored object.

CREATE TABLE IF NOT EXISTS `index_platform` (
  `generation` int(10) unsigned NOT NULL,
  `path_id`    varchar(255)     NOT NULL,
  `entity_id`  binary(22)       NOT NULL,
  KEY `generation` (`generation`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8;

The following was written in the config:

$db->index(
    'table'      => 'index_platform',
    'properties' => ['generation', 'path_id'],
    'shard_on'   => 'entity_id',
    'categories' => ['platform'],
));

Then when saving the object:

my $uuid = $db->put({
    'generation' => 0,
    'path_id'    => 'pc',
    'name'       => [
        {'lang' => 'en', 'value' => 'PC'}
    ]
});

An entry was created automatically in index_platform:

generation | path_id | entity_id
-----------|---------|----------
0          | pc      | $uuid

For these indices it was already possible to make samples:

$db->resultset('index_platform')->search({
    path_id    => {'in' => ['pc', 'xbox']},
    generation => {'in' => [0, 1]}
},
{
    order_by   => 'generation desc',
    limit      => 10,
    page       => 2,
    join       => 1 # с этим любые поля, содержащие первичный ключ из таблицы entity, превращались в объект, на который они ссылаются
})->all();

An alternative to the same query:

$db->resultset('index_platform')->search({
    path_id    => {'in' => ['pc', 'xbox']},
    generation => {'in' => [0, 1]}
})->order_by('generation desc')->limit(10)->page(2)->join(1)->all();

There were only two types of interaction with the database: this is changing the objects themselves in the entity table (including deletion) and querying the indexes as in the example above. JOIN is only software.

When deleting an object, it was only marked as deleted, without physically deleting from the entity table.

Around the middle of 2010, we tried to switch to this method of data storage.

Objects could now be saved in this form:

$company = {
    'name'        => [
        {'lang'     => 'ru',   'value' => 'ООО “Бед Пиксел”', 'is_primary' => true},
        {'lang'     => 'en',   'value' => 'Bad Pixel'},
    ],
    'description' => [
        {'lang'     => 'ru',   'value' => 'Тут длинное описание на русском”'},
    ],
    'link'        => [
        {'category' => 'tags', 'id'    => $tag_uuid,          'degree'     => 3},
        {'category' => 'game', 'id'    => $game_uuid,         'role'       => 'developer'},
    ]
};

Entries in the index_name and index_link index tables were automatically created by the name and link fields. Objects could be of any complexity and nesting, with a different set of fields for an object from one category. I still had to create index tables, but it became much easier. If some field was not enough, it was enough to change the code, and if it was necessary to make selections for this field, an additional index table was created or the existing one was changed. If some index didn’t work, you could simply delete it and build a new one. In the future, I wanted to make the creation of such index tables automatically according to the description of their structure in the code.

Such data storage (storage), along with the pluses, had significant disadvantages.
Of the advantages, it was possible to distinguish:

the presence of transactions (though for me it was not so important)
the existence of a patch for mysql, which allowed to process up to 750 thousand reads per second on the primary key
scalability - the architecture did not interfere with sharding data on the primary key, while nothing changed for the application working with the storage
since uuid was used as the primary key, it was possible to seamlessly merge how many suitable databases without collisions on the primary key
on-line index table generation

The main disadvantages:

you need to independently develop and maintain the code of the repository itself
lack of support for different languages
lack of atomic data operations; to make $ company -> {'link'} -> [0] -> {'degree'} ++, you need to load the object into the application, change it and save it back to the database
невозможно одним запросом изменить несколько записей, только через map-функцию в стороннем приложении
транзакции, масштабируемость, генерацию индексных таблиц нужно было еще реализовать
отсутствие консоли

In the process of operation, various “bugs” of the vault itself crawled out, of the most unpleasant ones - the dependence of the algorithm for serializing objects in the Storable library on the operating system. This was decided by switching to storing objects in the form of json with compression using gzip. By the way, it was during the correction of this “bug” that I clearly realized that it does not matter how the objects themselves are stored. It can be a separate table in the database, or you can save it stupidly in the form of json files, naming them by the primary key and scattering them into subfolders (however, this deprives the application itself of scalability, problems will arise due to race condition, etc., although, on the other hand, you could try Hadoop, but frankly, that would be superfluous). The main thing is to be able to create indexes for objects, as is done, for example, in the Sphinx search engine. Why not take MySQL in much the same way as Sphinx? Why not store the data in the form of a key-value store, but to search, sort and display the user various lists to create suitable indexes for this in the services suitable for this? Of course, if billing is created, then this approach, to put it mildly, is not very justified, but web applications for the most part are less demanding on having the same ACID, and you have to suffer almost the same way as with billing.

However, gradually the cons when using the self-written repository began to outweigh, and as before, for each object it was necessary to draw its own shape in the admin panel, in a word - there wasn’t the universality that I wanted. In addition, a lot of political events took place at AG in 2012, such as a change of site owner, leadership and management, and it was decided to write the second version in Python, since the programmers in the new company wrote in that language. There were two options - either to design the current storage as a standalone service, or use some existing key-value storage.

Mongodb

How would you feel if one day it turned out that someone created an analogue of your library (or vice versa - you created a certain analogue of an existing library without knowing it), and this analogue, while retaining the significant advantages of your solution, is still not possessed its minuses? Personally, I was glad. A magnificent console with full Javascript support, atomic operations, sharding, automatic creation of indexes on selected fields, libraries for the main programming languages ... At that time, Python frameworks that supported MongoDB or were written specifically for it already existed. And all this had neither to be supported nor developed. In addition, the api was similar to the storage api.

As a result, starting from scratch (already as Riot Pixels) development in Python in 2013, we, choosing the right tools (one of which was MongoDB), did more in the quarter than we did in two years. Another of the, it seems to me, the right decisions was the choice of admin panel, which allowed editing objects of any nesting - because of this, almost no time was spent on its development.

I would like to finish with this. Probably, it is wrong to contrast MongoDB and MySQL, because they are used differently for different tasks, and it just so happened that MongoDB was more suitable for us in this project. If suddenly this happens and the speed or functionality of MongoDB becomes lacking - nothing will prevent you from using MySQL as a caching layer / index for data - configure and forget once.

Tags: