shodan June 17, 2009 at 3:22 pm

How is the ranking arranged?

Over time, Sphinx has acquired a large pile of search and ranking modes. Regularly there are questions about various things (from "how to pull a document out in 1st place" to "how to draw from 1 to 5 stars depending on the degree of coincidence"), which are actually the essence of the questions about the internal structure of those modes. In this post I’ll tell you everything I remember: how search modes and ranking modes are arranged, what ranking factors are, how factors are calculated exactly, like the final weight, all that. And, of course, about the stars!

About search and ranking modes

First of all, we’ll figure out what those modes generally do. Through the API, 2 different methods are now available about them, SetMatchMode () and SetRankingMode (). It would seem different; but in fact, inside there is now the same thing. Previously, up to version 0.9.8 inclusive, only search modes were available, those. SetMatchMode (). All of them were implemented by different code branches. Each of the code branches itself did both the search for documents and their ranking. Moreover, of course, in different ways. In version 0.9.8, the development of a new unified document search engine was started. However, in order not to break compatibility, this engine was available only in SPH_MATCH_EXTENDED2 mode. Starting from 0.9.9, it became clear that the new engine is already quite stable and fast (and just in case, if you overlooked, there is version 0.9.8). This allowed us to remove a bunch of old code, and starting from 0.9.9, all search modes are processed by the new engine. For compatibility, when using outdated search modes, a simplified query parsing code is used (which ignores the operators of the full-text query language), and the correct (corresponding to the search mode) ranger is automatically set, but all the differences end there. Therefore actuallyThe weight of the document ( weight ) depends only on the ranking mode (ranker) . Therefore, the following two queries will give the same weights:

// 1й вариант
$cl->SetMatchMode ( SPH_MATCH_ALL );
$cl->Query ( "hello world" );
// 2й вариант
$cl->SetMatchMode ( SPH_MATCH_EXTENDED2 );
$cl->SetRankingMode ( SPH_RANK_PROXIMITY );
$cl->Query ( "hello world" );

In the second case, you can write @title hello(see the query language). In the first it is impossible (see compatibility).

Rankers calculate the weight of a document based on several internal factors available to them (and only to them). We can say that different rankers simply collect factors in the final weight in different ways. The two most important factors are 1) the classic statistical factor BM25 used by most (if not all) search engines since the 80s, and 2) the phrase weight factor specific to Sphinx.

About BM25

BM25 is a real number in the range from 0 to 1, which depends on the frequencies of keywords in the current document on the one hand and the general set of documents (collections) on the other, and only on frequencies . The current implementation of BM25 in Sphinx is calculated based on the total frequency of the word in the document, and not just the frequency of actual matches with the request. For example, for the title request hello (which matches 1 copy of the word hello in the title) BM25 factor will be calculated the same as for the @ (title, content) keyword query. The simplified implementation was done intentionally: in the standard ranking mode, the BM25 factor is secondary, the differences in ranking during testing turned out to be insignificant, but the speed differed quite measurably. The exact calculation algorithm looks like this:

BM25 = 0
foreach ( keyword in matching_keywords )
{
    n = total_matching_documents ( keyword )
    N = total_documents_in_collection
    k1 = 1.2
    TF = current_document_occurrence_count ( keyword )
    IDF = log((N-n+1)/n) / log(1+N)
    BM25 = BM25 + TF*IDF/(TF+k1)
}
BM25 = 0.5 + BM25 / ( 2*num_keywords ( query ) )

TF stands for Term Frequency (keyword frequency, in a ranked document). IDF stands for Inverse Document Frequency. The IDF for the keywords that are often found in the collection is less, for the rarer words more. Peak values come out IDF = 1 for a word that occurs exactly in one document, and IDF ~ = -1 for a word that is in every document. TF theoretically ranges from 0 to 1, depending on k1; with chosen k1 = 1.2, it actually varies from 0.4545 ... to 1.

BM25 grows when keywords are sparse and enters the document many times, and falls when keywords are frequent. The greatest value of BM25 is achieved when all keywords coincide with the document, and the words are super-rare (only in this 1 document from the entire collection are found), and they are also included in it many times. The smallest, respectively, when the document matches only 1 super-frequent word, which occurs in all documents, and ... also appears in the document many times.

It should be noted that too frequent words (more common than in half of the documents) reduce BM25! In fact, if a word is found in all documents except one, then this one document without a super-frequent word is still special, deserves more weight.

About phrase weight

The weight of the phrase (it is the degree of proximity with the query, it is also query proximity) is considered completely different. This factor does not take into account the frequency at all, but it takes into account the relative position of the keywords in the query and the document. To calculate it, Sphinx analyzes the positions of keywords in each field of the document, finds the longest continuous match with the query, and considers it, matches, length in keywords. Formally speaking, it finds the longest common Subsequence (LCS) of keywords between the query and the field being processed, and sets the phrase weight for this field to be equal to the LCS length. When translated back to Russian, the weight (under) of a phrase is the number of keywords that appeared in the field in exactly the same order as in the request . Here are some examples:

1) query = one two three, field = one and two three field_phrase_weight = 2 (совпала подфраза "two three", длиной 2 ключевых слова) 2) query = one two three, field = one and two and three field_phrase_weight = 1 (отдельные слова совпадают, но подфразы нет) 3) query = one two three, field = nothing matches at all field_phrase_weight = 0

To get the final phrase weight for a document, the phrase weights in each field are multiplied by the user-specified field weights (see SetFieldWeights () method) and the multiplication results are added together. (By the way, by default, the weights of the fields are 1, and cannot be set below 1.) The pseudocode looks like this:

doc_phrase_weight = 0 foreach ( field in matching_fields ) { field_phrase_weight = max_common_subsequence_length ( query, field ) doc_phrase_weight += user_weight ( field ) * field_phrase_weight }

For instance:

doc_title = hello world doc_body = the world is a wonderful place query = hello world query_title_weight = 5 query_body_weight = 3 title_phrase_weight = 2 body_phrase_weight = 1 doc_phrase_weight = 2*5+3*1 = 13

About a brighter future

The two factors described today are basic, but generally not the only ones. It is technically possible to consider other textual factors. For example, consider the “correct” BM25 based on actual matches; consider subphrase weights more cunning, taking into account the frequencies of incoming words; additionally take into account the number of words matching in the field; etc. You can also take into account all sorts of non-textual factors at the level of the ranker himself; in other words, to use some attributes in the process of calculating weight , and not in addition to the calculation.

About rankers

Finally, about ranking modes, for brevity rankers. They are the ones who collect the final weight from all sorts of different factors. The weights at the exit from the rankers are integer.

The default ranker (in extended / extended2 modes) is called SPH_RANK_PROXIMITY_BM25 and combines the phrase weight with BM25. The BM25 factor is located in the lower 3 decimal places, the weight of the phrase starting from the 4th and above. There are two related rankers, SPH_RANK_PROXIMITY and SPH_RANK_BM25 . The first returns as the weight simply the weight factor of the phrase itself. The second honestly considers only BM25, and instead of lengthy calculations, he quickly takes the weight of the phrase in each matching field equal to one.

// ранкер SPH_RANK_PROXIMITY_BM25 (по умолчанию) rank_proximity_bm25 = doc_phrase_weight*1000 + doc_bm25*999 // ранкер SPH_RANK_PROXIMITY (форсируется в режиме SPH_MATCH_ALL) rank_proximity = doc_phrase_weight // ранкер SPH_RANK_BM25 (быстрый, тк. не нужно считать дорогие веса подфраз) rank_bm25 = sum ( matching_field_weight )*1000 + doc_bm25*999

SPH_RANK_PROXIMITY_BM25 selected by default, maybe there is an opinion that of the available ones, it gives the most relevant result. The default in the future may change; There are quite a few plans to make smarter and generally better rankers. Ranker SPH_RANK_PROXIMITY, as it were, emulates the SPH_MATCH_ALL mode (the very, by the way, the first one, from which everything started in 2001). SPH_RANK_BM25 should be used if the weight of the phrase is not important for any reason; or just if you want to speed up the queries. By the way, the choice of a ranker significantly affects the speed of the request ! Typically, the most expensive part of the calculation is the analysis of the positions of words within the document. Therefore, SPH_RANK_PROXIMITY_BM25 will always be slower than SPH_RANK_BM25, and that in turn is always slower than SPH_RANK_NONE (which does not count at all).

Another ranker used to process historical search modes is SPH_RANK_MATCHANY . He also considers subphrase weights, as well as two other proximity rankers. But this, in addition, counts the number of matching keywords in each field, and combines it with subphrase weights so that a) a longer subphrase in any field arranges the entire document above; b) with the same length of the subphrase, a document with a higher number of matching words was ranked higher. On the fingers, if in document A there is a more accurate (long) subphrase of the query than in document B, it is necessary to rank document A higher. Otherwise (if the subphrases are the same length) we look simply at the number of words. The algorithm is as follows:

k = 0 foreach ( field in all_fields ) k += user_weight ( field ) * num_keywords ( query ) rank = 0 foreach ( field in matching_fields ) { field_phrase_weight = max_common_subsequence_length ( query, field ) field_rank = ( field_phrase_weight * k + num_matching_keywords ( field ) ) rank += user_weight ( field ) * field_rank }

Ranker SPH_RANK_WORDCOUNT stupidly sums up the number of matched keyword occurrences in each field, multiplied by the weight of the field. It’s simpler than SPH_RANK_NONE , which doesn’t count at all.

rank = 0 foreach ( field in matching_fields ) rank += user_weight ( field ) * num_matching_occurrences ( field )

Finally, SPH_RANK_FIELDMASK returns the bitmask of the fields matching the request. (Also uncomplicated, yes ...)

rank = 0 foreach ( field in matching_fields ) rank |= ( 1<

About stars

The question regularly arises of why the maximum possible weights are equal, and how to correctly convert them to asterisks (usually 5-point, but possible), percentages, or points on an intuitive scale from 1 to 17. As you can see, the maximum weight depends a lot on from the ranker, and from the request . For example, the absolute maximum weight at the output of SPH_RANK_PROXIMITY_BM25 depends on the number of keywords and field weights:

max_proximity_bm25 = num_keywords * sum ( field_weights ) * 1000 + 999

But this absolute maximum will be achieved only when all the fields contain an exact copy of the query in 1x, plus the query searches in all fields in 2x. And if the request is made with a restriction on one specific field (for example, title hello world)? The absolute maximum is unattainable in principle, for this particular type of request the maximum practical weight will be equal to:

max_title_query_weight = num_keywords * title_field_weight * 1000 + 999

So it’s quite difficult to accurately calculate the “correct” maximum weight analytically. If there is no life without stars, it is either relatively easy to consider the “absolute” maximum (which will almost never be achieved), or even take the maximum weight for each specific sample and normalize everything relative to it. Through a multi-query mechanism, this is done almost for free, but this is a topic for a separate article.

About complete matches (update)

A good question arose in the comments, why the full matches of the field with the request are not ranked higher, and how to achieve it.

The point is precisely in the logic of the work of the rankers. The default proximity and proximity_bm25 rankers in the case when the request is met in the field completely, assign such a field the maximum phrase weight (and this is their main ranking factor). At the same time, the length of the field itself is not taken into account, the fact of a complete coincidence of the field with the request is not taken into account either. Such is the historically established behavior. Apparently, the situation when the request completely coincides with one or another field, for some reason, it was not very common earlier.

In the current trunk (version 0.9.10), work is already underway, the experimental ranger SPH_RANK_SPH04 has been added, which just ranks the full matches of the field above. Technical opportunity appeared in 0.9.9, maybe. in 0.9.8, the index format does not provide the necessary data (for the curious, the saved positions do not have a flag “this was the end of the field”).

Something can be done without a new ranker.

For example, you can try to manually increase the weight in case of a complete match. To do this, use pens to remove all punctuation and upper case from the field itself (when indexing) and from the request, consider crc32 from the field, save its index attribute. Then, during the search, we calculate the expression weight + if (fieldcrc == querycrc, 1000.0) and sort the results by this expression. Pretty crooked, but in some cases it can help.

You can still slightly change the task, and take into account not the fact of complete coincidence, but the length of the document (or a separate field). To do this, when indexing, we save LENGTH (myfield) in the attribute, when searching, we rank by expression of the form (this is just an example!) Weight + ln (max (len, 1)) * 1000

In some cases (such as an example with indexing song names from comments ) it may turn out to be meaningful to index the fields not separately, but together so that the coincidence of the phrase “group - song” gives more weight to the phrase in the “glued” field. Otherwise, all fields are considered separate, the match “across the field boundary” will not be arranged above.

About file space

Is there anyway to dribble this whole kitchen further? And how. Having understood how the existing rankers are arranged, what factors they consider and how they combine, you can immediately immediately consciously correct the weight (more precisely, set a new arithmetic expression involving weight and sort the output by it). What is more interesting, it is technically possible to add new specialized rankers (a minute of advertising on the first: the rankers SPH_RANK_WORDCOUNT and SPH_RANK_FIELDMASK were not invented by me, commercial users requested) From the C ++ code of the rankers, there is immediate access to the query, keywords, ranking document, and (most important) a list of all occurrences of keywords matching the query along with positions in the document ... yes, you can still assemble a Soviet fighter from these details of the Soviet steam locomotive; most importantly, masterfully apply file magic.

Tags: