opium July 16, 2012 at 18:16

Improving search relevance in sphinxsearch

From the sandbox

Sphinxsearch is a search engine for fast fulltextsearch; it can receive data from mysql, oracle and mssql; it can act as a repository itself (realtime indexes). Also, sphinx has a mode of operation through api and through sphinxql - an analogue of the sql protocol (with some restrictions), which allows you to connect search through sphinx on a site with minimal code changes. This is one of the few great, large and open projects developed in Russia. In my life, I saw how sphinx processes about 100-200 search queries for 2 million records from mysql and at the same time the server breathed freely and did not feel sick, mysql starts to die already at 10 queries per second on a similar config.

The main problem of sphinx documentation is, in my opinion, a small number of examples for most interesting settings, today I will try to talk about them in the examples. The options that I will cover mainly algorithms and search variations. Everyone who works closely with sphinx will not learn anything new, and beginners hope I can improve the quality of the search on their sites.

Sphinx contains two independent programs, indexer and searchd. The first builds indexes on the data taken from the database, the second searches the constructed index. Now let's move on to the search settings in sphinx.

morphology

Allows you to specify the morphology of words, I use only stemming. Using the set of rules for the language, the stemming algorithm truncates endings and suffixes. Stemming does not use ready-made word bases, but is based on certain rules of circumcision for the language, which makes it small and fast, but it also adds to its disadvantages since it can make mistakes.

An example of normalization of a word by stemming in Russian.
The words “apple”, “apple”, “apple” will be truncated into “apples” and any search query with a variation of the word “apple” will also be normalized and will find entries with the words that were described above.

For English, the words “dogs” and “dog” will be normalized to “dog”.
For example, in sphinx it should put the word curly in the index, the word curly will fall into the index and there will be variations curly, curly, etc.
You can enable stemming for Russian, English or both languages

morphology = stem_en
morphology = stem_ru
morphology = stem_enru

You can also use the Soundex options and Metaphone they allow you to use for the English language, taking into account the sound of words. I do not use these morphological algorithms in my work, so if someone knows a lot about them I will be glad to read. For the Russian language, such algorithms would make it possible to obtain from the words “sun” and “sun” a normalized form of “sun”, which is obtained on the basis of the sound and pronunciation of these words.

morphology = stem_enru, Soundex, Metaphone

You can connect external engines for morphology or write your own.

Wordforms

It allows you to connect your wordform dictionaries, is well used on specialized thematic sites, has a good example in the documentation.

core 2 duo> c2d
e6600> c2d
core 2duo> c2d Lets

you find an article on core 2 duo for any search query from model to name variations.

hemp> grass
nonsense> grass
my charm> grass
grass freedom> grass
che smoke> grass
there che > grass

And this dictionary will allow your user to easily find information about grass on the site.

For word forms, files in ispell or MySpell format (which can be done in Open Office) are used

wordforms = /usr/local/sphinx/data/wordforms.txt

enable_star

Allows you to use asterisks in queries, for example, upon request * pr * prospectus, hello, approximation, etc. will be found.

Enable_star = 1

expand_keywords

Automatically expands the search query to three queries

running -> (running | * running * | = running)

Just a word with morphology, a word with asterisks and a full match of the word. Previously, this option was not there and in order to search with asterisks I had to manually make an additional request, now everything is turned on with one option. Also, the automatic match will be higher in the search results than search with asterisks and morphology.

expand_keywords = 1

index_exact_words

Allows, along with a morphologically normalized form, to store the original word in the index. This greatly increases the size of the index, but taking into account the previous option allows you to display results more relevant.

For example, there are three words “cantaloupe”, “cantaloupe”, “cantaloupe” without this option, all three words will be stored in the index as cantaloupes and upon request “cantaloupe” will be issued in the order of adding to the index that is “cantaloupe”, “cantaloupe” , "Melon."
If you enable the expand_keywords and index_exact_words options, then the query “melon” will have a more relevant output “melon”, “melon”, “melon”.

index_exact_words = 1

min_infix_len

Allows you to index parts of the word infixes, and search for them using *, such as search *, * search and * search *.
For example, with min_infix_len = 2 and entering the words “test” in the index, “te”, “ec”, “ct”, “tes”, “eats”, “test” will be saved in the index, and at the request of “ec” it will be found this word.

Usually I use

min_infix_len = 3

A lower value generates too much garbage and remember that using this option greatly increases the index.

min_prefix_len

It is a child of min_infix_len and does almost the same thing; it only saves the beginning of words or prefixes.
For example, with min_infix_len = 2 and entering the word “test” in the index, “those”, “tes”, “test” will be saved in the index, and at the request of “ec” this word will be found.
min_prefix_len = 3

min_word_len

The minimum word size for indexing, the default is 1 and indexes all words.
I usually use
min_word_len = 3
Words of a smaller size usually do not carry a semantic load.

html_strip

Cuts all html tags and html comments. This option is relevant if you are building your google / yandex based on sphinxsearch. They started the spider, sparsed the site, drove it into the database, set the indexer and this option will allow you to get rid of the trash in the form of html tags and search only by the content of the site.

I myself did not use it unfortunately, but the documentation says that it can mess with all sorts of xml and non-standard html (for example, wherever opening and closing tags, etc. hit).

html_strip = 1

I will be glad to any questions and clarifications.
Ofsayta sphinxsearch.com .
If the article was interesting to you, do not be too lazy to plus it.

Tags: