Olegbunin June 4, 2019 at 16:34

MongoDB Survival Guide

All good startups either die quickly or grow to scale. We will model such a startup, which is first about features, and then about performance. We’ll improve performance with MongoDB, a popular NoSQL data storage solution. MongoDB is easy to get started, and many problems have solutions out of the box. However, when the load increases, a rake comes out that no one warned you about before ... until today!

Modeling is performed by Sergey Zagursky , who is responsible for the backend infrastructure in general, and MongoDB in particular, in Joom. It was also seen in the server side of the development of MMORPG Skyforge. As Sergei describes himself, he is “a professional cone-taker with his own forehead and rake.” Under a microscope, a project that uses an accumulation strategy to manage technical debt. In this text version of the report in HighLoad ++, we will move in chronological order from the occurrence of the problem to the solution using MongoDB.

First difficulties

We are modeling a startup that stuffs bumps. The first stage of life - features are launched in our startup and, unexpectedly, users come. Our small-small MongoDB server has a load that we never dreamed of. But we are in the cloud, we are a startup! We do the simplest things possible: look at the requests - oh, and here we have the entire correction subtracted for each user, here we will build the indices, we’ll add the hardware there, and here we cache.
Everything - we live on!

If problems can be solved by such simple means, they should be solved in this way.

But the future path of a successful startup is a slow, painful delay of the horizontal scaling moment. I will try to give advice on how to survive this period, get to scaling and not step on the rake.

Slow recording

This is one of the problems you may encounter. What to do if you meet her, and the methods above do not help? Answer: durability guarantee mode in MongoDB by default . In three words it works like this:

We came to the primary line and said: "Write!".
Primary replica recorded.
After that, secondary replicas were read from her and they said primary: “We recorded!”

At the moment when most secondary replicas did this, the request is considered complete and control returns to the driver in the application. Such guarantees allow us to be sure that when control has returned to the application, durability will not go anywhere, even if MongoDB lies down, except for absolutely terrible disasters.

Fortunately, MongoDB is such a database that allows you to reduce durability guarantees for each individual request.

For important requests, we can leave the maximum durability guarantees by default, and for some requests we can reduce them.

Request Classes

The first layer of guarantees that we can remove is not to wait for confirmation of the record by most replicas . This saves latency, but does not add bandwidth. But sometimes latency is what you need, especially if the cluster is a little overloaded and secondary replicas do not work as fast as we would like.

{w:1, j:true}

If we write records with such guarantees, then at the moment when we get control in the application, we no longer know whether the record will be alive after some kind of accident. But usually, she’s still alive.

The next guarantee, which affects bandwidth and latency too, is disabling logging confirmation . A journal entry is written anyway. The magazine is one of the fundamental mechanisms. If we turn off the confirmation of writing to it, then we don’t do two things: fsync on the log and do not wait for it to end . This can save a lot of disk resources and get a multiple increase in throughput by simply changing the durability of the guarantee.

{w:1, j:false}

The most stringent durability guarantees are disabling any acknowledgments . We will only receive confirmation that the request has reached the primary replica. This will save latency and not increase throughput in any way.

{w:0, j:false} — отключаем любые подтверждения.

We will also receive various other things, for example, the recording failed due to a conflict with a unique key.

What operations does this apply to?

I'll tell you about the application to the setup in Joom. In addition to the load from users, in which there are no durability concessions, there is a load that can be described as a background batch-load: updating, recounting ratings, collecting analytics data.

These background operations can take hours, but are designed so that if a break, for example, a backend crashes, they will not lose the result of all their work, but will resume from the point in the recent past. Reducing the durability guarantee is useful for such tasks, especially since fsync in the log, like any other operations, will increase the latency also for reading.

Read scaling

The next problem is insufficient reading bandwidth . Recall that in our cluster there are not only primary replicas, but also secondary ones from which you can read . Let's do it.

You can read, but there are nuances. Slightly outdated data will come from secondary replicas - by 0.5–1 seconds. In most cases, this is normal, but the behavior of the secondary replica is different from the behavior of the primary replicas.

On secondary, there is the process of using oplog, which is not on the primary replica. This process is not that designed for low latency - just the MongoDB developers did not bother with this. Under certain conditions, the process of using oplog from primary to secondary can cause delays of up to 10 s.

Secondary replicas are not suitable for user queries - user experiences take a brisk step into the bin.

On unshaded clusters, these spikes are less noticeable, but still there. Shard clusters suffer because oplog is particularly affected by deletion, and deletion is part of the balancer's job . The balancer reliably, tastefully deletes documents by tens of thousands in a short period of time.

Number of connections

The next factor to consider is the limit on the number of connections on MongoDB instances . By default, there are no restrictions, except for OS resources - you can connect while it allows.

However, the more concurrent concurrent requests, the slower they run. Performance degrades nonlinearly . Therefore, if a spike of requests arrives to us, it is better to serve 80% than not to service 100%. The number of connections must be limited directly to MongoDB.

But there are bugs that can cause trouble because of this. In particular, the connection pool on the MongoDB side is common for both user and service intracluster connections. If the application "ate" all the connections from this pool, then integrity may be violated in the cluster.

We learned about this when we were going to rebuild the index, and since we needed to remove uniqueness from the index, the procedure went through several stages. In MongoDB, you cannot build next to the index the same, but without uniqueness. Therefore, we wanted:

Build a similar index without uniqueness
remove the index with uniqueness;
Build an index without uniqueness instead of remote;
delete temporary.

When the temporary index was still being completed on secondary, we began to delete the unique index. At this point, secondary MongoDB announced its lock. Some metadata was blocked, and in the majority all records stopped: they hung in the connection pool and waited for them to confirm that the record had passed. All reads on secondary also stopped because the global log was captured.

The cluster in such an interesting state also lost its connectivity. Sometimes it appeared and when two remarks connected with each other, they tried to make a choice in their state that they could not make, because they have a global lock.

Moral of the story: the number of connections must be monitored.

There is a well-known MongoDB rake, which is still so often attacked that I decided to take a short walk on it.

Do not lose documents

If you send a request by index to MongoDB, then the request may not return all documents that satisfy the condition, and in completely unexpected cases. This is due to the fact that when we go to the beginning of the index, the document, which at the end, moves to the beginning for those documents that we passed. This is due solely to the mutability of the index . For reliable iteration, use indexes on non-stable fields and there will be no difficulties.
MongoDB has its own views on which indexes to use. The solution is simple - with the help of $ hint, we force MongoDB to use the index we specified .

Collection Sizes

Our startup is developing, there is a lot of data, but I do not want to add disks - we have already added three times in the last month. Let's see what is stored in our data, look at the size of the documents. How to understand where in the collection you can reduce the size? According to two parameters.

By the size of specific documents to play with their length Object.bsonsize():;
According to the average size of the document to llektsii : db.c.stats().avgObjectSize.

How to affect the size of the document?

I have non-specific answers to this question. First, a long field name increases the size of the document. In each document, all field names are copied, so if the document has a long field name, then the size of the name must be added to the size of each document. If you have a collection with a huge number of small documents on several fields, then name the fields with short names: "A", "B", "CD" - a maximum of two letters. On disk, this is offset by compression , but everything is stored in the cache as is.

The second tip - sometimes some fields with low cardinality can be placed in the name of the collection. For example, such a field may be a language. If we have a collection with translations into Russian, English, French and a field with information about the stored language, the value of this field can be put in the name of the collection. So we will reduce the size of documents and can reduce the number and size of indexes - sheer savings! This can not always be done, because sometimes there are indexes inside the document that will not work if the collection is divided into different collections.

Last tip for document size - use the _id field. If your data has a natural unique key, put it directly in the id_field. Even if the key is composite - use a composite id. It is perfectly indexed. There is only one small rake - if your marshaller sometimes changes the order of fields, then id with the same field values, but with different order will be considered different id in terms of a unique index in MongoDB. In some cases, this can happen in Go.

Index Sizes

The index stores a copy of the fields that are included in it . The size of the index consists of the data that is indexed. If we are trying to index large fields, then get ready for the size of the index to be large.

The second moment strongly inflates indexes: array fields in the index multiply other fields from the document in this index . Be careful with large arrays in documents: either do not index something else to the array, or play around with the order in which the fields in the index are listed.

The order of the fields matters , especially if one of the index fields is an array. If the fields differ in cardinality, and in one field the number of possible values is very different from the number of possible values in another, then it makes sense to build them by increasing cardinality. You can easily save 50% of the index size if you swap fields with different cardinality. The permutation of the fields can give a more significant reduction in size.

Sometimes, when the field contains a large value, we do not need to compare this value more or less, but rather a clear equality comparison. Then the index on the field with heavy content can be replaced by the index on hash from this field . Copies of hash will be stored in the index, not copies of these fields.

Delete documents

I already mentioned that deleting documents is an unpleasant operation and it is better not to delete if possible. When designing a data schema, try to consider either minimizing the removal of individual data or deleting entire collections. they could be deleted with entire collections. Removing collections is a cheap operation, and deleting thousands of individual documents is a difficult operation.

If you still need to delete a lot of documents, be sure to do throttling , otherwise the mass deletion of documents will affect the latency of reading and it will be unpleasant. This is especially bad for latency on secondary.

It’s worth making some kind of “pen” to turn throttling - it’s very difficult to pick up the level the first time. We went through it so many times that throttling is guessed from the third, fourth time. Initially, consider the possibility of tightening it.

If you delete more than 30% of a large collection, then transfer live documents to the neighboring collection , and delete the old collection as a whole. It is clear that there are nuances, because the load is switched from the old to the new collection, but shift if possible.

Another way to delete documents is the TTL index, which is an index that indexes the field that contains the Mongo timestamp, which contains the date the document died. When this time comes, MongoDB will delete this document automatically.

The TTL index is convenient, but there is no throttling in the implementation. MongoDB does not care about how to remove these deletions. If you try to delete a million documents at the same time, for a few minutes you will have an inoperable cluster that deals only with deletion and nothing more. To prevent this from happening, add some randomness , spread the TTL as much as your business logic and special effects on latency allow. Smearing TTL is imperative if you have natural business logic reasons that concentrate deletion at one point in time.

Sharding

We tried to postpone this moment, but it has come - we still have to scale horizontally. For MongoDB, this is sharding.

If you doubt that you need sharding, then you do not need it.

Sharding complicates the life of a developer and devops in a variety of ways. In a company, we call it sharding tax. When we shard a collection, the specific performance of the collection decreases : MongoDB requires a separate index for sharding, and additional parameters must be passed to the request so that it can be executed more efficiently.

Some sharding things just don't work well. For example, it is a bad idea to use queries with skip, especially if you have a lot of documents. You give the command: "Skip 100,000 documents."

MongoDB thinks this way: “First, second, third ... one hundred thousandth, let's go further. And we will return this to the user. ”

In a non-shared collection, MongoDB will perform an operation somewhere within itself. In shard-like - she really reads and sends all 100,000 documents to a sharding proxy - in mongos , which already on its side will somehow filter out and discard the first 100,000. An unpleasant feature to keep in mind.

The code will certainly become more complicated with sharding - you will have to drag the sharding key to many places. This is not always convenient, and not always possible. Some queries will go either broadcast or multicast, which also does not add scalability. Come to the choice of a key by which sharding will be more accurate.

The operation breaks in shard collectionscount. She begins to return a number more than in reality - she can lie 2 times. The reason lies in the balancing process, when documents are poured from one shard to another. When the documents poured onto a neighboring shard, but have not yet left on the original one, countthey will be counted anyway. MongoDB developers do not call this a bug - it is such a feature. I don’t know if they will fix it or not.

A shuffled cluster is much more difficult to administer . Devops will stop greeting you, because the process of removing a backup becomes radically more complicated. When sharding, the need for infrastructure automation flashes like a fire alarm - something that you could have done without before.

How sharding works in MongoDB

There is a collection, we want to somehow scatter it around shards. To do this, MongoDB divides the collection into chunks using the shard key, trying to divide them into equal pieces in the shard key space. Next comes the balancer, who diligently lays out these chunks according to the shards in the cluster . Moreover, the balancer does not care how much these chunks weigh and how many documents are in them, since balancing is done in pieces by piece.

Sharding Key

Do you still decide what to shard? Well, the first question is how to choose a sharding key. A good key has several parameters: high cardinality , non-stability and it fits well in frequent requests .

The natural choice of a sharding key is the primary key - the id field. If the id field is suitable for sharding, then it is better to shard directly on it. This is an excellent choice - he has good cardinality, he is not stable, but how well he fits into frequent requests is your business specificity. Build on your situation.

I will give an example of a failed sharding key. I already mentioned the collection of translations - translations. It has a language field that stores the language. For example, the collection supports 100 languages and we shard language. This is bad - cardinality, the number of possible values is only 100 pieces, which is small. But this is not the worst - maybe cardinality is enough for these purposes. Worse, as soon as we shuffled around the language, we immediately find out that we have 3 times more English-speaking users than the rest. Three times as many requests come to the unfortunate shard in which English is located than to all the others combined.

Therefore, it should be borne in mind that sometimes a shard key naturally tends to an uneven load distribution.

Balancing

We come to sharding when the need has ripened for us - our MongoDB cluster creaks, crunches with its disks, processor - with everything we can. Where to go? Nowhere, and we heroically shuffle the heels of collections. We shard, launch, and suddenly find out that balancing is not free .

Balancing goes through several stages. The balancer chooses chunks and shards, from where and where he will transfer. Further work goes in two phases: first, documents are copied from source to target, and then documents that were copied are deleted .

Our shard is overloaded, it contains all the collections, but the first part of the operation is easy for him. But the second - the removal - is quite unpleasant, because it will put a shard on the shoulder blades and already suffer under load.

The problem is compounded by the fact that if we balance a lot of chunks, for example, thousands, then with the default settings all these chunks are first copied, and then a remover comes in and starts to delete them in bulk. At this point, the procedure is no longer affected and you only have to sadly watch what is happening.

Therefore, if you are approaching to shard an overloaded cluster, you need to plan, since balancing takes time.It is advisable to take this time not in prime time, but in periods of low load. Balancer - a disconnected spare part. You can approach the primary balancing in manual mode, turn off the balancer in prime time, and turn it on when the load has decreased to allow yourself more.

If the cloud’s capabilities still allow you to scale vertically, it’s best to improve the shard source in advance in order to slightly reduce all these special effects.

Sharding must be carefully prepared.

HighLoad ++ Siberia 2019 will come in Novosibirsk on June 24 and 25. HighLoad ++ Siberia is an opportunity for developers from Siberia to listen to reports, talk on highloaded topics and plunge into the environment "where everyone has their own", without flying over three thousand kilometers to Moscow or St. Petersburg. Of the 80 applications, the Program Committee approved 25, and we tell about all other changes in the program, announcements of reports and other news in our mailing list. Subscribe to stay informed.

Tags: