uaoleg November 20, 2014 at 11:28

Why you should never say never

This publication of mine is more than a complete answer to the translation of the article “Why you should never use MongoDB” . The article, which essentially recommends staying away from MongoDB, is the most oversubscribed in the hub. And that sounds like a sentence. Therefore, it’s logical to either close the hub and never read again, or write an even more denial rebuttal. Of course, I chose the second option, risking my rating and karma (due to the extreme holivariness in the comments).

A picture of self-irony

Database selection

Referring to authoritative sources "some argue" and "others say", the author substantiates his choice of a database. Perhaps I want too much, but it would be great to wrap these “sources” with links to official recommendations from the developers of the mentioned databases.

But what are the alternatives? Some argue that graph databases are best suited, but I will not consider them, since they are too niche for mass projects. Others say documentaries are ideal for social data, and they are mainstream enough for real-world use. Let's look at why people think that MongoDB, rather than PostgreSQL, is much better for social data.

The assertion that graph databases are too niche (???) for mass projects (???) working with social graphs is very difficult to justify even by the most authoritative source. From my own experience I can say that graph databases, despite their name, work perfectly with graphs (including social graphs). Of course, you should not take graph databases as the main data warehouse. For these purposes, tools such as MySQL, MongoDB, Postgres, etc. remain indispensable. In the same way, when we need, for example, full-text search, we use Sphinx in conjunction with the main data warehouse, and do not make a painful choice between it, Redis and MariaDB.

Data modeling

You can read about how to correctly model data in MongoDB in the official guide . Usually, it is advisable to do this before you sign up for a project and authoritatively offer architectural solutions and tools for their implementation. Yes, even if the document has 40 pages - you still need to read. Otherwise, you can easily overwhelm a project with an inept choice or inept use of good tools. And to write large articles with funny pictures about the fact that the compote is awkward with a fork.

We could also model this data as a set of nested objects (a set of key-value pairs). A lot of information about a particular series is one big structure of nested key-value sets. Inside the series, there are many seasons, each of which is also an object (a set of key-value pairs). Within each season, an array of episodes, each of which is an object, and so on. This is how data is modeled in MongoDB. Each series is a document that contains all the information about one series.

The series has a name and an array of seasons. Each season is an object with metadata and an array of episodes. In turn, each episode has metadata and arrays of reviews and actors.

If Sarah read the manual on data modeling in MongoDB, then she would know that in this database you can work with a little more than one data model. But in fact, I have one very simple question: has the author ever held a calculator in her hands? I won’t even ask about Google and Wikipedia. Why am I? Let's google the longest series. This masterpiece, called Guide Light , contains an 18.262 series of diversity and philosophy. Against this background, Santa Barbara with its 2,137 series is a white dwarf in the starry sky of cinema art. But this is all the lyrics. Now I again become a bore and argue that before you authoritatively choose a database, you need to not only familiarize yourself with the recommended data storage models, but also withlimitations of this base. One of these restrictions is the maximum document size, which cannot exceed 16 MB . And now we pick up the calculator, hold it, draw it and perform the magic calculation, which I read in the DBA Architect spellbook:

16 Мб / 18.262 серии = 876 // Байт на серию

Eight hundred seventy six bytes to store each series. Yes, this is not enough even for a brief description of what was in the previous series! But we also need to include in this very document a list of actors, reviews (!!!) and God knows what. No, seriously, did the author think about this even for a minute? In principle, this miscalculation alone is already enough to understand the level of the article I criticize.

With the permission of the reader, I will criticize a couple of the author’s statements.

After we started making ugly joins manually in Diaspora code, we realized that this is only the first sign of problems. It was a signal that our data is actually relational, that there is value in this connection structure, and we are moving against the basic idea of documentary DBMSs.

In fact, it’s a mystery for me how the author of the article, time after time, manages to follow the worst of all possible scenarios. He is trying to fit all the data in one document, without bothering to calculate the elementary restrictions. That takes “best practice” from the muddy Diaspora, infamous only for the suicide of its co-founder, and now also ugly joins. Why not just read what they think about using joins in official documentation? After all, there in the very first sentence it is written in black and white that joins are not supported, period. An attempt to screw up their joins enlivens in my head a vivid image of dear colleagues from the banks of the great Ganges River. They look at me from Skype with their kind eyes and say that they implemented part of the basket functionality directly in the core of the framework, because it was faster and easier. Holy Shiva! But this is not a reason to blame the framework, is it?

At the same time, the absence of joins does not deprive the developer of the opportunity to work with several documents. If documents are stored in one collection, then you can make very functional and expressive requests using the Aggregation Framework. You can, for example, easily get even such a very exotic metric as the average number of films in which actors with the name Sisi starred :

db.actors.aggregate( [
   { $match : { name: "Sisi" } }
   { $group : { _id : "$movieId", count: { $sum : 1 } } },
   { $group : { _id : null, avgMovieCount : { $avg : "$count" } } }
] )

If you need to tighten the related data, then this is done in a separate request. And this approach is not inferior in efficiency to a complex query with joins and is used in all official MongoDB ORM and ODM . The main advantage of individual queries is that related data is only requested as needed (lazy loading). If you need to make completely hardcore queries, including between different collections, then MapReduce comes to the rescue . But personally, in my practice, such requests for the entire application could collect a maximum of a few pieces.

After three months in development, everything worked great with MongoDB. But once on Monday at a planning meeting, a client said that one of the investors wants a new feature. He wants to be able to click on the actor’s name and watch his career in television series. He wants a list of all the episodes in all the series in chronological order in which this actor starred.

Outside Web 3.0, but Sarah seriously designed the system in such a way that the client’s desire to click on the author’s name and see the list of TV shows plunged her into a natural shock. And in fact, this is an absolutely logical reckoning for all those miscalculations that were made earlier. Payback for reluctance to read official documentation and recommendations.

Good. But what about social data?

Right. When you get to a social network, there is only one important part of the page: your activity feed. The activity feed request receives all posts from your friends, sorted by date. Each post contains attachments, such as photos, likes, reposts and comments.

Unfortunately, the author continues to walk on his favorite rake. And instead of studying the experience of people developing this base, she composes on the go cases simplified to disgrace, which naturally will cease to work in combat conditions. In fact, designing a news feed on social networks is a very extensive topic on which you can write more than one article. Just such an article I just recently published .

From the comments

To complete the picture, I also decided to include in my article the answers to a couple of the most rated comments with constructive criticism of MongoDB.

Comment by cleg :

and then the banal task of “deleting a series” in the absence of integrity support turns into a multi-stage process:
- we delete all related records from the collection with series
- we delete all related records from the collection with reviews
- we delete the record from the collection of series

and if there are many records, somewhere this process has screwed up - we still have inconsistency

Delete series? What is it like? All his copies were burned, and the actors were sent to Alpha Centauri? What is the real case of removing the series? Even if there is one, then you don’t need to delete anything, just mark the series as “deleted”. And this approach is applied in any large system.. Moreover, on some projects, in the development of which I took part, there was a requirement not only not to delete the data physically, but also to record all changes. If you allow the removal of something in the series service, then perhaps the comments on the reviews. And even that, our beloved Habr does not even have such an opportunity and positions it as a feature. Such a situation with deletions can be explained, firstly, by the value of the data, and secondly by the fact that in MySQL itself, on large volumes, with sharding and replication, cascading deletion does not work as smoothly as it might seem.

Comment by gandjustas :

This is a ghostly advantage. The program will still have a scheme (types), in a live application, taking and changing the scheme will not work. There is an example right in the article - there were TV shows where the actors are included in the documents of the series. Then it was necessary to change the scheme - to pull the essence of the actors into separate documents.

How did the lack of a circuit help? It seems that in any way.

I totally agree. The absence of a schema in the database does not eliminate the need to think over data storage models at the application design stage. This is really a completely misunderstanding of schemaless, unfortunately quite common. And the sad experience of Sarah is another confirmation of this. The only thing that gives the lack of a schema in the database is the elimination of the need to duplicate this schema in the code of your application, in data models. And personally, I really like it. For example, why indicate in the database and in the validation rules of the user model that the name is a string with a maximum length of 100 characters? MongoDB allows you to describe data models only in the application, without forcing to transfer part of these rules to the database schema. And that’s all that schemaless gives. MongoDB is not able to magically level out developer errors when designing data models. Alas.

Conclusion

In conclusion, I want to emphasize once again that it makes no sense to blame the hammer in the broken finger, and the plug in the leaking compote. Instead, it's best to use the right tools correctly . And for this you need to read the documentation and study the official best practice. That’s all for me. All good and sharding out of the box.

Tags:

Why you should never say never

Database selection

Data modeling

From the comments

Conclusion

Also popular now: