yurtaev April 24, 2013 at 03:01

MongoDB: too many fields to index? Use a generic index

Transfer

The essence of the problem

There are situations when documents have many different fields and it is necessary to have effective queries on them. For example, there is a document describing a person:

{
    _id: 123,
    firstName: "John",
    lastName: "Smith",
    age: 25,
    height: 6.0,
    dob: Date,
    eyes: "blue",
    sign: "Capricorn",
    ...
}

According to such documents, you can make a selection of people according to eye color, a certain height, last name and other characteristics. But what if for example a document consists of dozens of fields, or are not known in advance, or each document has its own set of fields? How to quickly solve this problem with the help of indexes, but at the same time not to build them by each field, because this is too expensive a solution.

Solution # 1: Compound Index by Field Names and Values

Let's design the document scheme using the ability to store document fields in the form of a list of objects:

{
    _id: 123,
    props: [
    { n: "firstName", v: "John"},
    { n: "lastName", v: "Smith"},
    { n: "age", v: 25},
    ...
    ]
}

To solve the problem, a composite index is created by the name and value of the objects inside the list. For clarity, we will create 5 million documents consisting of fictitious properties from prop0to prop9which have a random value from 0to 1000.

> for (var i = 0; i < 5000000; ++i) { var arr = []; for (var j = 0; j < 10; ++j) { arr.push({n: "prop" + j, v: Math.floor(Math.random() * 1000) }) }; db.generic.insert({props: arr}) }
> db.generic.findOne()
{
  "_id": ObjectId("515dd3b4f0bd676b816aa9b0"),
  "props": [
    {
      "n": "prop0",
      "v": 40
    },
    {
      "n": "prop1",
      "v": 198
    },
...
    {
      "n": "prop9",
      "v": 652
    }
  ]
}
> db.generic.ensureIndex({"props.n": 1, "props.v": 1})
> db.generic.stats()
{
  "ns": "test.generic",
  "count": 5020473,
  "size": 1847534064,
  "avgObjSize": 368,
  "storageSize": 2600636416,
  "numExtents": 19,
  "nindexes": 2,
  "lastExtentSize": 680280064,
  "paddingFactor": 1,
  "systemFlags": 1,
  "userFlags": 0,
  "totalIndexSize": 1785352240,
  "indexSizes": {
    "_id_": 162898624,
    "props.n_1_props.v_1": 1622453616
  },
  "ok": 1
}

In this case, the index size is 1.6 GB, because the index stores both the name of the property and its value. Now let's try to find documents in which it prop1is 0:

> db.generic.findOne({"props.n": "prop1", "props.v": 0})
{
  "_id": ObjectId("515dd4298bff7c34610f6ae8"),
  "props": [
    {
      "n": "prop0",
      "v": 788
    },
    {
      "n": "prop1",
      "v": 0
    },
...
    {
      "n": "prop9",
      "v": 788
    }
  ]
}
> db.generic.find({"props.n": "prop1", "props.v": 0}).explain()
{
  "cursor": "BtreeCursor props.n_1_props.v_1",
  "isMultiKey": true,
  "n": 49822,
  "nscannedObjects": 5020473,
  "nscanned": 5020473,
  "nscannedObjectsAllPlans": 5020473,
  "nscannedAllPlans": 5020473,
  "scanAndOrder": false,
  "indexOnly": false,
  "nYields": 0,
  "nChunkSkips": 0,
  "millis": 252028,
  "indexBounds": {
    "props.n": [
      [
        "prop1",
        "prop1"
      ]
    ],
    "props.v": [
      [
        {
          "$minElement": 1
        },
        {
          "$maxElement": 1
        }
      ]
    ]
  },
  "server": "agmac.local:27017"
}

Such a solution did not give the expected result: ~ 50,000 documents were found in 252 seconds. This is because each request n=prop1and v=0does not require both conditions at the same time for nested documents, so the final result get the documents satisfy both the requirement n=prop1and v=0for separateness, and this is not what was expected. You can refine the request by using $elemMatch:

> db.generic.findOne({"props": { $elemMatch: {n: "prop1", v: 0} }})

Now let's check how the index is used and how long the query has been running in MongoDB v2.2:

> db.generic.find({"props": { $elemMatch: {n: "prop1", v: 0} }}).explain()
{
  "cursor": "BtreeCursor props.n_1_props.v_1",
  "isMultiKey": true,
  "n": 5024,
  "nscannedObjects": 5020473,
  "nscanned": 5020473,
  "nscannedObjectsAllPlans": 5020473,
  "nscannedAllPlans": 5020473,
  "scanAndOrder": false,
  "indexOnly": false,
  "nYields": 0,
  "nChunkSkips": 0,
  "millis": 278784,
  "indexBounds": {
    "props.n": [
      [
        "prop1",
        "prop1"
      ]
    ],
    "props.v": [
      [
        {
          "$minElement": 1
        },
        {
          "$maxElement": 1
        }
      ]
    ]
  },
  "server": "agmac.local:27017"
}

The request was executed correctly and returned 5024 documents, but still slowly! From the command information explainyou can see that the reason is that the vrange is still used for the field . In order to understand why this happens, we will analyze an example in more detail. If you do not use $elemMatchall combinations of fields that satisfy at least one of the query conditions individually, they can fall into the final selection. In this case, it would be impossible to use to maintain the index, because it would lead to a huge number of possible combinations. Therefore, when prompted, MongoDB made a choice in favor of building a B-Tree from the values of the attached documents and ignoring possible combinations (the main behavior for $elemMatch). But why query with$elemMatchrunning so slow? This was due to a bug that was fixed by SERVER-3104 in MongoDB v2.4. Check the same request in the fixed version:

> db.generic.find({"props": { $elemMatch: {n: "prop1", v: 0} }}).explain()
{
  "cursor": "BtreeCursor props.n_1_props.v_1",
  "isMultiKey": true,
  "n": 5024,
  "nscannedObjects": 5024,
  "nscanned": 5024,
  "nscannedObjectsAllPlans": 5024,
  "nscannedAllPlans": 5024,
  "scanAndOrder": false,
  "indexOnly": false,
  "nYields": 0,
  "nChunkSkips": 0,
  "millis": 21,
  "indexBounds": {
    "props.n": [
      [
        "prop1",
        "prop1"
      ]
    ],
    "props.v": [
      [
        0,
        0
      ]
    ]
  },
  "server": "agmac.local:27017"
}

The request was completed in 21 milliseconds!

Solution # 2: One General Index

Another way to solve the problem is to store the fields in the list as objects property: value. This solution works in MongoDB v2.2 and v2.4. Create documents of the form:

> for (var i = 0; i < 5000000; ++i) { var arr = []; for (var j = 0; j < 10; ++j) { var doc = {}; doc["prop" + j] =  Math.floor(Math.random() * 1000); arr.push(doc) }) }; db.generic2.insert({props: arr}) }
> db.generic2.findOne()
{
  "_id": ObjectId("515e5e6a71b0722678929760"),
  "props": [
    {
      "prop0": 881
    },
    {
      "prop1": 47
    },
...
    {
      "prop9": 717
    }
  ]
}

Build the index:

> db.generic2.ensureIndex({props: 1})
> db.generic2.stats()
{
  "ns": "test.generic2",
  "count": 5000000,
  "size": 1360000032,
  "avgObjSize": 272.0000064,
  "storageSize": 1499676672,
  "numExtents": 19,
  "nindexes": 2,
  "lastExtentSize": 393670656,
  "paddingFactor": 1,
  "systemFlags": 1,
  "userFlags": 0,
  "totalIndexSize": 2384023488,
  "indexSizes": {
    "_id_": 162269072,
    "props_1": 2221754416
  },
  "ok": 1
}

The index size turned out to be ~ 2.2 GB in size, which is 40% more than in solution # 1 because BSON of attached documents stores itself in the index as BLOBs. Now run the query:

> db.generic2.find({"props": {"prop1": 0} }).explain()
{
  "cursor": "BtreeCursor props_1",
  "isMultiKey": true,
  "n": 4958,
  "nscannedObjects": 4958,
  "nscanned": 4958,
  "nscannedObjectsAllPlans": 4958,
  "nscannedAllPlans": 4958,
  "scanAndOrder": false,
  "indexOnly": false,
  "nYields": 0,
  "nChunkSkips": 0,
  "millis": 15,
  "indexBounds": {
    "props": [
      [
        {
          "prop1": 0
        },
        {
          "prop1": 0
        }
      ]
    ]
  },
  "server": "agmac.local:27017"
}

The request was completed in 15 ms, which is faster than the first solution! But there is one condition, when compiling a request, it is necessary to describe the entire sub-document object. In order to select documents that satisfy the query where it prop1can be equal 0to 9, you must complete the query:

> db.generic2.find({"props": { $gte: {"prop1": 0}, $lte: {"prop1": 9} })

It’s a little inconvenient, and also if there are other fields in the attached document, then they should participate in the preparation of the request (because the attached documents are stored as BLOBs).
There is also one more limitation: it is impossible to separately index only field values, while in solution # 1 you can build an index props.vto search for, for example, all documents that matter 10. Solution # 2 does not allow this.

Conclusion

You can see that MongoDB v2.4 now offers a simple and effective solution for building common indexes for documents with a large number of fields, which you can use for your "Big Data" projects.

Tags:

MongoDB: too many fields to index? Use a generic index

The essence of the problem

Solution # 1: Compound Index by Field Names and Values

Solution # 2: One General Index

Conclusion

Also popular now: