About the fight against quality

Exactly three days later, we will reveal to everyone a bunch of secrets: about tuning, optimization, search quality and scaling of the Sphinx (this is still such a full-text search engine and not only) in different directions. Details at the very end of the post.

And here I’ll start to reveal one of the secrets about the quality of the search right here and now. This is a new thing called expression ranker, added in version 2.0.2-beta (the correct Russian translation has not yet been invented), and I’ll tell you a little more about it under the cut. In short, it allows you to set your own ranking formula on the fly , and even a separate one for each request. In general, a kind of designer who gives everyone the opportunity to try to build their own MatrixNet, with four-dimensional chess and opera singers.

Right off the bat

Emulation of the default ranking mode (for extended query mode, which is the syntax) in SphinxQL looks, for example, like this: Through SphinxAPI, respectively:

SELECT *, WEIGHT() FROM myindex WHERE MATCH('hello world')

OPTION ranker=expr('sum(lcs*user_weight)*1000+bm25')

$client->SetRankingMode(SPH_RANK_EXPR, "sum(lcs*user_weight)*1000+bm25");

How does it work

In the ranking formula, you can use any document attributes and mathematical functions, as in "ordinary" expressions. But in addition to them in the ranking formula - and only in it - there are several more values and functions that are specific to ranking. Namely, document-level factors, field-level factors, and functions aggregating everything according to a set of fields. All of these additional factors are textual factors , those. certain numbers, which depend on the text of the document and the request, and are calculated on them on the fly. For example, the number of unique words that match in the current field or throughout the document will be just a textual factor. Non-textual factors also exist in sciencethose. all sorts of numbers that are independent of the texts. This is something like the number of page views, the price of a product, and the like. But this can be simply put in the attribute and then used both inside and outside the ranking. For reference, factors are also called signals . It is the same.

What are the factors

Document level factors are as follows:

bm25 , rough (!) quick estimate of the statistical function BM25 for the entire document. The bore in me cannot keep silent and not inform that after all coarsening for optimization it is, in fact, more likely BM15 than canonical BM25. A more understandable explanation for everyone else: this is some kind of magic integer in the range from 0 to 999, which grows if there are a lot of rare words in the document and falls if there are a lot of frequent ones.
max_lcs , the maximum possible value is sum (lcs * user_weight). Used to emulate MATCH_ANY, well, in general, it can come in handy for any normalization.
field_mask , 32-bit mask of matching fields.
query_word_count , the number of unique "included" keywords in the query. In other words, the total number of unique keywords adjusted for "excluded" words. For example, in the query (one! Two) it will be 1, maybe. the word (two) is excluded and will never match. For the request (one one one! Two) is also 1, maybe. the unique (!) included word is still only one. And for the request (one two three) already, accordingly, the value of the factor will be 3.
doc_word_count , the number of unique words matching in the current document. It is clear that in no way should exceed query_word_count.

Field level factors are as follows:

lcs , the same magical factor that considers the degree of “phrase match” between the query and the field. Formally, the length of the largest common subsequence of words in the query and document. It is equal to 0 if nothing coincided in the field; equal to 1 if at least something matches; and in the limit is equal to the number of (included) query words, if the query matches the field perfectly.
user_weight , user weight of the field assigned through SetFieldWeights () or OPTION field_weights, respectively.
hit_count , the number of matched occurrences of keywords. One keyword can generate multiple occurrences. For example, if the request (hello world) matches a field in which hello occurs 3 times and world 5, then hit_count will be 8. If, however, in the field the phrase hello world occurs exactly 1 time (despite the fact that the words are mentioned 3 and 5 times) + in addition, the request was “hello world” in quotation marks, then hit_count will be equal to 2. The total number of occurrences of the words is still 8, but in the second case with the phrase matching of them only 2.
word_count , the number of unique words matching in the field (NOT occurrences). In both previous examples is equal to 2, for example.
tf_idf , the sum of TF * IDF for all matching keywords. TF is just the number of occurrences (Term Frequency), but IDF is another magic metric that takes into account the "rarity" of the word: for super-frequent words (which are in every document) it is 0, and for a unique word (1 occurrence in 1 document in entire collection) is equal to 1.
min_hit_pos , the very first matching keyword position in the field. The numbering starts with 1. It is useful, for example, to rank matches at the beginning of the field above.
min_best_span_pos , the first position of the "best" (with the largest) lcs subset of keywords. Numbering is still from 1.
exact_hit , a boolean flag that is cocked if the field matches completely. Values, respectively, are 0 and 1.

The aggregation function at the moment is exactly one, SUM. The expression inside the function is calculated for all fields, then the results are summed. These same SUM can be done several, with different expressions inside.
For obvious reasons, the field level factor must occur strictly within the aggregation function. In the end, we must calculate exactly one number, respectively, without “linking” to a specific field, such factors have no physical meaning. Document level factors and attributes can, of course, be used anywhere in the expression.

How to emulate existing rankers

All previously existing rankers, in fact, in the form of new cool formulas are extremely simple. A maximum of a couple of factors per ranker. The list is:

SPH_RANK_PROXIMITY_BM25 = sum (lcs * user_weight) * 1000 + bm25
SPH_RANK_BM25 = bm25
SPH_RANK_NONE = 1
SPH_RANK_WORDCOUNT = sum (hit_count * user_weight)
SPH_RANK_PROXIMITY = sum (lcs * user_weight)
SPH_RANK_MATCHANY = sum ((word_count + (lcs-1) * max_lcs) * user_weight)
SPH_RANK_FIELDMASK = field_mask
SPH_RANK_SPH04 = sum ((4 * lcs + 2 * (min_hit_pos == 1) + exact_hit) * user_weight) * 1000 + bm25

Emulation, of course, will work slower than using the built-in ranker. Still, the compiled code is still faster than our expression calculator! However, the slowdown, which I cannot stop wondering, is often insignificant. Even in the case where the search has matched and should rank hundreds of thousands and millions of documents, I got differences of about 30-50% on the “micro” benchmarks (literally, instead of 0.4 seconds with the built-in ranker, about 0.5-0.6 seconds with emulation). I suspect that if less than 1-10k documents match, then the differences cannot be discerned at all.

What's next!?

What to do with all this? From the point of view of opportunities for improving the quality of search, quite a lot of things have become possible. In fact, now you can completely twist the ranking as you like. There are a bunch of new factors that were never considered before, and now you can twist right on the fly. There was a technical opportunity to quickly and easily add new factors at your request - call, a number of factors were added that way, at the request of commercial customers.
It is clear that this is only the tip of the iceberg, and immediately there are a lot of questions: how to measure this very “quality”, and how to twist formulas, and so on. But I write slowly, but I speak quickly, so if you want to listen to this post live and more in detail, learn twice as much about relevance and quality of search in two, and at the same time listen to a few more reports about everything else announced at the beginning of the post, then welcome to a conference . (Peter. December 04, Sunday. It is free, but registration is needed. There are still some free places, but you need to rush right now.)

Hello everyone, good luck in the fight against the quality of the search :)

Tags: