MongoDB: too many fields to index? Use a generic index
- Transfer
The essence of the problem
There are situations when documents have many different fields and it is necessary to have effective queries on them. For example, there is a document describing a person:
{
_id: 123,
firstName: "John",
lastName: "Smith",
age: 25,
height: 6.0,
dob: Date,
eyes: "blue",
sign: "Capricorn",
...
}
According to such documents, you can make a selection of people according to eye color, a certain height, last name and other characteristics. But what if for example a document consists of dozens of fields, or are not known in advance, or each document has its own set of fields? How to quickly solve this problem with the help of indexes, but at the same time not to build them by each field, because this is too expensive a solution.
Solution # 1: Compound Index by Field Names and Values
Let's design the document scheme using the ability to store document fields in the form of a list of objects:
{
_id: 123,
props: [
{ n: "firstName", v: "John"},
{ n: "lastName", v: "Smith"},
{ n: "age", v: 25},
...
]
}
To solve the problem, a composite index is created by the name and value of the objects inside the list. For clarity, we will create 5 million documents consisting of fictitious properties from
prop0
to prop9
which have a random value from 0
to 1000
.> for (var i = 0; i < 5000000; ++i) { var arr = []; for (var j = 0; j < 10; ++j) { arr.push({n: "prop" + j, v: Math.floor(Math.random() * 1000) }) }; db.generic.insert({props: arr}) }
> db.generic.findOne()
{
"_id": ObjectId("515dd3b4f0bd676b816aa9b0"),
"props": [
{
"n": "prop0",
"v": 40
},
{
"n": "prop1",
"v": 198
},
...
{
"n": "prop9",
"v": 652
}
]
}
> db.generic.ensureIndex({"props.n": 1, "props.v": 1})
> db.generic.stats()
{
"ns": "test.generic",
"count": 5020473,
"size": 1847534064,
"avgObjSize": 368,
"storageSize": 2600636416,
"numExtents": 19,
"nindexes": 2,
"lastExtentSize": 680280064,
"paddingFactor": 1,
"systemFlags": 1,
"userFlags": 0,
"totalIndexSize": 1785352240,
"indexSizes": {
"_id_": 162898624,
"props.n_1_props.v_1": 1622453616
},
"ok": 1
}
In this case, the index size is 1.6 GB, because the index stores both the name of the property and its value. Now let's try to find documents in which it
prop1
is 0
:> db.generic.findOne({"props.n": "prop1", "props.v": 0})
{
"_id": ObjectId("515dd4298bff7c34610f6ae8"),
"props": [
{
"n": "prop0",
"v": 788
},
{
"n": "prop1",
"v": 0
},
...
{
"n": "prop9",
"v": 788
}
]
}
> db.generic.find({"props.n": "prop1", "props.v": 0}).explain()
{
"cursor": "BtreeCursor props.n_1_props.v_1",
"isMultiKey": true,
"n": 49822,
"nscannedObjects": 5020473,
"nscanned": 5020473,
"nscannedObjectsAllPlans": 5020473,
"nscannedAllPlans": 5020473,
"scanAndOrder": false,
"indexOnly": false,
"nYields": 0,
"nChunkSkips": 0,
"millis": 252028,
"indexBounds": {
"props.n": [
[
"prop1",
"prop1"
]
],
"props.v": [
[
{
"$minElement": 1
},
{
"$maxElement": 1
}
]
]
},
"server": "agmac.local:27017"
}
Such a solution did not give the expected result: ~ 50,000 documents were found in 252 seconds. This is because each request
n=prop1
and v=0
does not require both conditions at the same time for nested documents, so the final result get the documents satisfy both the requirement n=prop1
and v=0
for separateness, and this is not what was expected. You can refine the request by using $elemMatch
:> db.generic.findOne({"props": { $elemMatch: {n: "prop1", v: 0} }})
Now let's check how the index is used and how long the query has been running in MongoDB v2.2:
> db.generic.find({"props": { $elemMatch: {n: "prop1", v: 0} }}).explain()
{
"cursor": "BtreeCursor props.n_1_props.v_1",
"isMultiKey": true,
"n": 5024,
"nscannedObjects": 5020473,
"nscanned": 5020473,
"nscannedObjectsAllPlans": 5020473,
"nscannedAllPlans": 5020473,
"scanAndOrder": false,
"indexOnly": false,
"nYields": 0,
"nChunkSkips": 0,
"millis": 278784,
"indexBounds": {
"props.n": [
[
"prop1",
"prop1"
]
],
"props.v": [
[
{
"$minElement": 1
},
{
"$maxElement": 1
}
]
]
},
"server": "agmac.local:27017"
}
The request was executed correctly and returned 5024 documents, but still slowly! From the command information
explain
you can see that the reason is that the v
range is still used for the field . In order to understand why this happens, we will analyze an example in more detail. If you do not use $elemMatch
all combinations of fields that satisfy at least one of the query conditions individually, they can fall into the final selection. In this case, it would be impossible to use to maintain the index, because it would lead to a huge number of possible combinations. Therefore, when prompted, MongoDB made a choice in favor of building a B-Tree from the values of the attached documents and ignoring possible combinations (the main behavior for $elemMatch
). But why query with$elemMatch
running so slow? This was due to a bug that was fixed by SERVER-3104 in MongoDB v2.4. Check the same request in the fixed version:> db.generic.find({"props": { $elemMatch: {n: "prop1", v: 0} }}).explain()
{
"cursor": "BtreeCursor props.n_1_props.v_1",
"isMultiKey": true,
"n": 5024,
"nscannedObjects": 5024,
"nscanned": 5024,
"nscannedObjectsAllPlans": 5024,
"nscannedAllPlans": 5024,
"scanAndOrder": false,
"indexOnly": false,
"nYields": 0,
"nChunkSkips": 0,
"millis": 21,
"indexBounds": {
"props.n": [
[
"prop1",
"prop1"
]
],
"props.v": [
[
0,
0
]
]
},
"server": "agmac.local:27017"
}
The request was completed in 21 milliseconds!
Solution # 2: One General Index
Another way to solve the problem is to store the fields in the list as objects
property: value
. This solution works in MongoDB v2.2 and v2.4. Create documents of the form:> for (var i = 0; i < 5000000; ++i) { var arr = []; for (var j = 0; j < 10; ++j) { var doc = {}; doc["prop" + j] = Math.floor(Math.random() * 1000); arr.push(doc) }) }; db.generic2.insert({props: arr}) }
> db.generic2.findOne()
{
"_id": ObjectId("515e5e6a71b0722678929760"),
"props": [
{
"prop0": 881
},
{
"prop1": 47
},
...
{
"prop9": 717
}
]
}
Build the index:
> db.generic2.ensureIndex({props: 1})
> db.generic2.stats()
{
"ns": "test.generic2",
"count": 5000000,
"size": 1360000032,
"avgObjSize": 272.0000064,
"storageSize": 1499676672,
"numExtents": 19,
"nindexes": 2,
"lastExtentSize": 393670656,
"paddingFactor": 1,
"systemFlags": 1,
"userFlags": 0,
"totalIndexSize": 2384023488,
"indexSizes": {
"_id_": 162269072,
"props_1": 2221754416
},
"ok": 1
}
The index size turned out to be ~ 2.2 GB in size, which is 40% more than in solution # 1 because BSON of attached documents stores itself in the index as BLOBs. Now run the query:
> db.generic2.find({"props": {"prop1": 0} }).explain()
{
"cursor": "BtreeCursor props_1",
"isMultiKey": true,
"n": 4958,
"nscannedObjects": 4958,
"nscanned": 4958,
"nscannedObjectsAllPlans": 4958,
"nscannedAllPlans": 4958,
"scanAndOrder": false,
"indexOnly": false,
"nYields": 0,
"nChunkSkips": 0,
"millis": 15,
"indexBounds": {
"props": [
[
{
"prop1": 0
},
{
"prop1": 0
}
]
]
},
"server": "agmac.local:27017"
}
The request was completed in 15 ms, which is faster than the first solution! But there is one condition, when compiling a request, it is necessary to describe the entire sub-document object. In order to select documents that satisfy the query where it
prop1
can be equal 0
to 9
, you must complete the query:> db.generic2.find({"props": { $gte: {"prop1": 0}, $lte: {"prop1": 9} })
It’s a little inconvenient, and also if there are other fields in the attached document, then they should participate in the preparation of the request (because the attached documents are stored as BLOBs).
There is also one more limitation: it is impossible to separately index only field values, while in solution # 1 you can build an index
props.v
to search for, for example, all documents that matter 10
. Solution # 2 does not allow this.Conclusion
You can see that MongoDB v2.4 now offers a simple and effective solution for building common indexes for documents with a large number of fields, which you can use for your "Big Data" projects.