Full Text Search in MongoDB
- Tutorial
This article will discuss one of the new features of MongoDB version 2.4 - full-text search. Most of this article will be a free translation of the documentation, which, by the way, is very detailed, but fragmented. Everything will be put together here. Since this seemed to me not enough for a full article, I decided to compare MongoDB with another popular text search program - Sphinx. My comparison will be very superficial, since I did not work with the Sphinx before. I’ll create a table with 16,000,000 records and see who is faster.
If you are interested in a comparison, then it is at the very end, but here I will tell you how to create a text index in Mongo and what can be done with it.
Previously, to search for text in a mong, one could use either regular expressions, or oneself-created and indexed arrays with words. The main drawback of regular expression searches was that it couldn’t use indexes effectively for all queries, and non-trivial regular expressions are hard to write. The index was used well when the regex told you to find something at the very beginning of the line and in several other cases. Another option with an index created from an array of words obtained after dividing sentences avoided this drawback, but was not convenient.
Now there is absolutely nothing to do to get a quick text search. We create a text index, make a request, and the system itself removes the word stop, makes tokens, stamping and puts down a numerical characteristic indicating the relevance of the result. For stemming, the Porter Stemmer is used . The list of stop words can be viewed on the github - for example, for the Russian language .
Let's start with the result you can get:
As you can see, the Mongo found sentences where the word “sword” was most often found, as well as a sentence where the term “sword” was changed at the word
The text search is still in test mode, so you must explicitly specify the corresponding mongod option at startup:
The main command to create the index:
after which all the text in the subject and content fields of the selected collection will be indexed.
By default, the index is created for the English language, to change this, set the default_language option:
it is also possible to create an index that will apply to each document the language that is specified in the specified field of the document - documentation.
You can also create an index that will search for all fields in the document.
The command to create an index that will calculate the weight of the result depending on the weight of the field specified when creating:
Learn more about creating indexes.
A new command was added that allows you to do a text search - “text”:
“Text” is a command, “sword” is a search word.
If you specify several words with a space to search, they will be combined by the logical OR operator (options for logical AND not)
To find the exact match with the given word or expression, you must quote it:
If it is necessary to exclude texts with a specific word from the results, then in the query it is enough to put a “-” before this word, for example:
The limit on the number of results is set by the limit option:
The returned fields are specified by the project option:
In order to search for documents with a given field, you need to set the filter option:
More details
Consider the search result:
part of the brackets is cut out
Here:
queryDebugString - the documentation does not say what it is, but probably these are words after stemming
language - the language that was used to search for
results -
score list of results - a characteristic showing how exactly the query matches the result
stats dictionary - additional information
nscanned - how many documents We found using the index
nscannedObjects - scanned documents without the use of the index (the smaller this parameter the better)
n - number of results returned
nfound - the number of matches
timeMicros - search duration and microsomes undah
More
the text and text2 tables are the same:
As you can see from the results, the regular expression search completed in 31 milliseconds, and the text index search in 151 microseconds, which is 200 times less.
The comparison was carried out on the Ubuntu 12.10 OS (Core i5, 8GB RAM, Hard drive (without raid)). Candidates: MongoDM 2.4.1 and Sphinx 2.0.6. In Mong and Mysql, tables of the form id, text were created. The tables were identical and contained 16 million records. A text index was created in Mong. An index for text search was also configured in the sphinx; options for using the stop word sheet and the stemming algorithm were additionally included. To interact with the contestants, Python clients were used - sphinxapi and pymongo.
The test was to find a thousand words in a table. It was carried out "warming up" and several repeated executions. In the sphinx, no additional settings were included, with the exception of stemming, stop words and an increase in the available memory. The memory usage of the programs is approximately identical, 2.2 GB used the Sphinx, 2.5 GB used the Mongo.
As you can see from the results, Mongo loses. Due to the specific work with the OP, Mongo searches a second time faster than the first. This is due to the fact that Mongo stores in memory only the requested data. On the first test run, the index has not yet been loaded into memory. But even with a busy index, Mongo is more than 20 times slower.
When searching for smaller tables, the gap is narrowed, but still is in the area of the 10-fold advantage of the Sphinx.
It is also worth noting that the text index in Mong uses approximately 2 times more memory to store itself than indexed data.
With an impressive margin in the search for text, the Sphinx wins.
In defense of Mongo, we can say that:
The new MongoDB function will not greatly change the balance of power in the field of data storage.
If you are interested in a comparison, then it is at the very end, but here I will tell you how to create a text index in Mongo and what can be done with it.
Previously, to search for text in a mong, one could use either regular expressions, or oneself-created and indexed arrays with words. The main drawback of regular expression searches was that it couldn’t use indexes effectively for all queries, and non-trivial regular expressions are hard to write. The index was used well when the regex told you to find something at the very beginning of the line and in several other cases. Another option with an index created from an array of words obtained after dividing sentences avoided this drawback, but was not convenient.
Now there is absolutely nothing to do to get a quick text search. We create a text index, make a request, and the system itself removes the word stop, makes tokens, stamping and puts down a numerical characteristic indicating the relevance of the result. For stemming, the Porter Stemmer is used . The list of stop words can be viewed on the github - for example, for the Russian language .
List of supported languages:
danish
dutch
english
finnish
french
german
hungarian
italian
norwegian
portuguese
romanian
russian
spanish
swedish
turkish
dutch
english
finnish
french
german
hungarian
italian
norwegian
portuguese
romanian
russian
spanish
swedish
turkish
Let's start with the result you can get:
db.text.runCommand( "text" , { search: "меч",project:{text:1,_id:0},limit: 3 } )
Result
{
"queryDebugString" : "меч||||||",
"language" : "russian",
"results" : [
{
"score" : 1,
"obj" : {
"text" : "Мой меч был отбит его щитом; его меч наткнулся на мой"
}
},
{
"score" : 0.85,
"obj" : {
"text" : "С этим странным выкриком скелет несколько раз махнул мечом; с каждым взмахом меч оставлял в воздухе голубоватый след"
}
},
{
"score" : 0.8333333333333334,
"obj" : {
"text" : "Лоуренс тоже слышал, что рыцари, клянясь кому-то в верности, передают ему щит и меч, ибо щит и меч являют собой саму душу рыцаря"
}
}
],
"stats" : {
"nscanned" : 168,
"nscannedObjects" : 0,
"n" : 3,
"nfound" : 3,
"timeMicros" : 320
},
"ok" : 1
}
As you can see, the Mongo found sentences where the word “sword” was most often found, as well as a sentence where the term “sword” was changed at the word
Connect text search
The text search is still in test mode, so you must explicitly specify the corresponding mongod option at startup:
mongod --setParameter textSearchEnabled=true
Create a text index
The main command to create the index:
db.collection.ensureIndex( {subject: "text", content: "text"} )
after which all the text in the subject and content fields of the selected collection will be indexed.
By default, the index is created for the English language, to change this, set the default_language option:
db.collection.ensureIndex( { content : "text" }, { default_language: "russian" })
it is also possible to create an index that will apply to each document the language that is specified in the specified field of the document - documentation.
You can also create an index that will search for all fields in the document.
The command to create an index that will calculate the weight of the result depending on the weight of the field specified when creating:
db.blog.ensureIndex( {content: "text",keywords: "text", about: "text"}, {weights: { content: 10,keywords: 5, },name: "TextIndex" })
Learn more about creating indexes.
Search query execution
A new command was added that allows you to do a text search - “text”:
db.collection.runCommand( "text", { search: "меч" } )
“Text” is a command, “sword” is a search word.
If you specify several words with a space to search, they will be combined by the logical OR operator (options for logical AND not)
To find the exact match with the given word or expression, you must quote it:
db.quotes.runCommand( "text", { search: "\"сегодня завтра\"" } )
If it is necessary to exclude texts with a specific word from the results, then in the query it is enough to put a “-” before this word, for example:
db.quotes.runCommand( "text" , { search: "сегодня -завтра" } )
The limit on the number of results is set by the limit option:
db.quotes.runCommand( "text", { search: "tomorrow", limit: 2 } )
The returned fields are specified by the project option:
db.quotes.runCommand( "text", { search: "tomorrow", project: { "src": 1 } } )
In order to search for documents with a given field, you need to set the filter option:
db.quotes.runCommand( "text", { search: "tomorrow", filter: { speaker : "macbeth" } } )
More details
Parsing the result
Consider the search result:
part of the brackets is cut out
{
"queryDebugString" : "долг|хабр|чест||||||",
"language" : "russian",
"results" :
"score" : 1.25,
"obj" : {
"text" : "- Накормить долг долгом"
"score" : 0.9166666666666667,
"obj" : {
"text" : "В результате это я окажусь перед тобой в долгу, и этот долг мне никогда не выплатить"
"score" : 0.8863636363636365,
"obj" : {
"text" : "Оставить реальный мир и полететь прямо в эту крепость… долго-долго это была моя единственная мечта"
"stats" : {
"nscanned" : 145,
"nscannedObjects" : 0,
"n" : 3,
"nfound" : 3,
"timeMicros" : 155
},
"ok" : 1
}
Here:
queryDebugString - the documentation does not say what it is, but probably these are words after stemming
language - the language that was used to search for
results -
score list of results - a characteristic showing how exactly the query matches the result
stats dictionary - additional information
nscanned - how many documents We found using the index
nscannedObjects - scanned documents without the use of the index (the smaller this parameter the better)
n - number of results returned
nfound - the number of matches
timeMicros - search duration and microsomes undah
More
Text search vs $ regex + index
db.text.runCommand( "text" , { search: "находить",project:{text:1,_id:0}} ).stats
{
"nscanned" : 77,
"nscannedObjects" : 0,
"n" : 77,
"nfound" : 77,
"timeMicros" : 153
}
db.text2.find( { text: { $regex: 'находить'} }).explain();
{
"cursor" : "BtreeCursor text_1 multi",
"n" : 5,
"nscannedObjects" : 5,
"nscanned" : 15821,
"nscannedObjectsAllPlans" : 5,
"nscannedAllPlans" : 15821,
"indexOnly" : false,
"millis" : 31,
"indexBounds" : {
"text" : [["",{}],
[
/находить/,
/находить/
]
]
},
}
the text and text2 tables are the same:
their statistics
> db.text.stats ()
{
"ns": "text_test.text",
"count": 15821,
"size": 3889044,
"avgObjSize": 245.8153087668289,
"storageSize": 6983680,
"numExtents": 5,
" nindexes ": 2,
" lastExtentSize ": 5242880,
" paddingFactor ": 1,
" systemFlags ": 0,
" userFlags ": 1,
" totalIndexSize ": 7358400,
" indexSizes ": {
" _id_ ": 523264,
" text_text ": 6835136
},
"ok": 1
}
> db.text2.stats ()
{
"ns": "text_test.text2",
"count": 15821,
"size": 2735244,
"avgObjSize": 172.8869224448518,
"storageSize": 5591040,
"NumExtents": 6,
"nindexes": 2,
"lastExtentSize": 4194304,
"paddingFactor": 1,
"systemFlags": 0,
"userFlags": 0,
"totalIndexSize": 3008768,
"indexSizes": {
"_id_" : 523264,
“text_1”: 2485504
},
“ok”: 1
}
The difference is due to different indexes, and the data is exactly the same
{
"ns": "text_test.text",
"count": 15821,
"size": 3889044,
"avgObjSize": 245.8153087668289,
"storageSize": 6983680,
"numExtents": 5,
" nindexes ": 2,
" lastExtentSize ": 5242880,
" paddingFactor ": 1,
" systemFlags ": 0,
" userFlags ": 1,
" totalIndexSize ": 7358400,
" indexSizes ": {
" _id_ ": 523264,
" text_text ": 6835136
},
"ok": 1
}
> db.text2.stats ()
{
"ns": "text_test.text2",
"count": 15821,
"size": 2735244,
"avgObjSize": 172.8869224448518,
"storageSize": 5591040,
"NumExtents": 6,
"nindexes": 2,
"lastExtentSize": 4194304,
"paddingFactor": 1,
"systemFlags": 0,
"userFlags": 0,
"totalIndexSize": 3008768,
"indexSizes": {
"_id_" : 523264,
“text_1”: 2485504
},
“ok”: 1
}
The difference is due to different indexes, and the data is exactly the same
As you can see from the results, the regular expression search completed in 31 milliseconds, and the text index search in 151 microseconds, which is 200 times less.
MongoDB vs Sphinx
The comparison was carried out on the Ubuntu 12.10 OS (Core i5, 8GB RAM, Hard drive (without raid)). Candidates: MongoDM 2.4.1 and Sphinx 2.0.6. In Mong and Mysql, tables of the form id, text were created. The tables were identical and contained 16 million records. A text index was created in Mong. An index for text search was also configured in the sphinx; options for using the stop word sheet and the stemming algorithm were additionally included. To interact with the contestants, Python clients were used - sphinxapi and pymongo.
The test was to find a thousand words in a table. It was carried out "warming up" and several repeated executions. In the sphinx, no additional settings were included, with the exception of stemming, stop words and an increase in the available memory. The memory usage of the programs is approximately identical, 2.2 GB used the Sphinx, 2.5 GB used the Mongo.
As you can see from the results, Mongo loses. Due to the specific work with the OP, Mongo searches a second time faster than the first. This is due to the fact that Mongo stores in memory only the requested data. On the first test run, the index has not yet been loaded into memory. But even with a busy index, Mongo is more than 20 times slower.
When searching for smaller tables, the gap is narrowed, but still is in the area of the 10-fold advantage of the Sphinx.
It is also worth noting that the text index in Mong uses approximately 2 times more memory to store itself than indexed data.
Conclusion
With an impressive margin in the search for text, the Sphinx wins.
In defense of Mongo, we can say that:
- mongo has many more functions besides text search
- it’s easier to scale horizontally while increasing productivity
- text search still in test mode
- text search in Mongo is easier than in the Sphinx and requires a couple of hours less to learn
The new MongoDB function will not greatly change the balance of power in the field of data storage.