tumbler April 20, 2016 at 12:13

Say a word about video search

Today, on this fine spring day, I want to write not only about video search, but also about the technical
implementation of working with Sphinxsearch in a loaded Django-project.

It’s probably worth starting with setting a business task:

You need to search for relevant videos by title, description and other text data
For each video you need to look for related videos
It is necessary that the right videos are shown in the issuance of the right requests in the right places.

And non-functional requirements:

Django project with dofig views and constant updates of video descriptions
Encapsulating the work with the search engine in the library and compatibility with other libraries on the site (primarily the Django REST Framework)

About how sphinxsearch is used in Rutube and this story will be.

About video relevance and math

When they talk about searching on the Internet, usually they mean searching for textual information. In the case of the video, everything is much worse. Usually a person means a very specific visual image, which he translates into the text of the request. Other people who uploaded videos to the site translated the contents of the video into the title and description of the video, and it’s good if it’s not “test”, “sdfsdf” or “111”. In any case, there is only a minimum of textual information, and sometimes some metadata affixed by the editors and partner users. So if you are a search programmer, the questions “why aren’t searched for“ rp ”will be followed by“ Real Boys ”at night. A special test utility helps us answer such questions.
This is a page that, by a search query, returns not only the video with all the fields that are in the index, but also information from the ranker with the values of all the characteristics for each document. For this, it is requested PACKEDFACTORS(), SNIPPETS()and CALL KEYWORDS. The data on this page is usually enough to mathematically prove that what is called irrelevant (here it is, it’s a magic word that only programmers understand in the mathematical sense, and all the rest in the spiritual!) ... So, justify why the irrelevant video turned out to be above relevant .

Django backend search base

Since Sphinxsearch supports the mysql client, why not take a backend for Django and build the search queries we need and then return the results as Django models? I do not suggest everyone to try to do this , but it really is not so scary. And for those who are still scared, or simply not interested, we suggest proceeding immediately to the next section.

As usual, there is something useful on the github. The same django-sphinx-db , for example, helped start working with the engine directly from django models. In Django-1.8, the private part of the database backend implementation has been greatly changed, which made porting django-sphinx-dbproblematic. As a result, the django-sphinxsearch project appeared , which lacks a bit of attention from the development side, but which is already used in our production. Here, by the way, is an example of difficulties with backend support: a new version of Django is released, and everything falls apart, because "my guts change what I want." So you have to start from the beginning.

It looks something like this:

We are looking for a PEP-0249-compatible database connector. MySQL-python, psycopg, python_monetdb - depends on what Django is screwed to.
The backend closest in spirit is taken and inherited. In the case of sphinxsearch it is MySQL, it is django.db.backends.mysql.
The most difficult thing is to learn how to SQLCompilergenerate code that is compatible with the database under which the backend is written. This applies to the use of quotation marks, the ability to specify table names without specifying a schema, LIMIT / OFFSET syntax, and the like. Here I want to say a separate “fairy” to the Django developers for the fact that the method SQLCompiler.as_sqlthat collects a string from QuerySet.query is a monolith for almost 100 lines; as a result, in order to change it LIMIT OFFSET, LIMIT start, endyou have to regularly go through the result of calling the base class method.
QuerySet methods are added that provide search-specific functionality. For example, it SphinxQuerySet.matchadds to self.query a structure self.matchcontaining the data necessary to build a SphinxQL expression. The match field is cloned, modified in the QuerySet, and finally used in SphinxWhereNode.make_atomto generate part of the query string. Nothing complicated, you just have to write tests and have a good debugger on hand.

Search Results Ranking

Search results are usually sorted by how relevant they are to the search query. How to count it? For example, you can take the number of words that are simultaneously present in the document and the query. The more words in the intersection, the more accurately the result fits this query. You can not only take the number of matching words, but also take into account their sequence. And if for each word its “rarity” is taken into account, then it’s generally cool: the presence of prepositions and conjunctions in the request and the document will no longer influence the issue. There are many different, useful and not so many invented such characteristics , so in the general case it is reasonable to use a weighted sum of the values of all the characteristics that the engine considers.

In addition to the characteristics that associate a specific document with a search query, it is possible to add additional weight to documents with certain attributes regardless of the query. For example, to increase the output of videos loaded in good quality. Or add weight to more recent or more viewed videos.

So we add to the request

SELECT weight() + a * view_count + b * age as my_weight,
...
OPTION ranker=expr('...')
ORDER BY my_weight DESC;

and the sort order of the issuance you have under full control.

So in a simple way they wind up:

general view of the ranking formula
weight of individual fields (title - 10 times more important than description)
weights of individual characteristics
additional bonuses to those results that “match” with a search query

QuerySet.iterator ()

If tuning the search query is not enough, you can “nail” the results with nails. For some queries, this is generally a critical functionality, so you do not want to, but you have to implement mechanisms for manipulating search results.

We are looking for whether there are videos for the current request that the editors would like to see in the search results; we get the positions that they are occupied.
We request a search, except for the "nail" -results.
We change the method QuerySet.iterator()so that it in the “normal” state produces results from sphinxsearch, and in some places those same “nails” beaten videos (for example, we’ll return the episode from “Real Boys” at the request of “pasadoble”. No comments )
If the search yields irrelevant results, you can generally, for example, instead of similar clips display something from the database, for example, a list of episodes of the same series. To do this, it is enough that the main Video model coincides in fields with the model of the SearchVideo search result.

Technical limitations

I’ll tell you a little about what you can’t do in sphinxsearch. The strangest, and at the same time explainable “impossible”: it is impossible to issue all the issuance except one or several documents. You can just do fullscan, but not fullscan WHERE MATCH('~document_id'). Software prohibits, they say, inefficiently.

There are two restrictions on limits: the first, SELECT *without explicit indication of LIMIT returns 20 results, something like repr(queryset); second, to find the 100500th element, you need to add to the request OPTION max_matches=100500. There is partial sorting inside the engine, the window size of which by default is 1000. As a result, requesting a larger offset is an error.

There are many strange restrictions on numerical operations with attributes. For example, you can write float_field <> 3.1415in SELECT, but not in WHERE. What can you do, especially the parser. Fights through QuerySet.extra().

The most unpleasant "impossible": you can not rely on the fact that the search will not crash at the most unpleasant moment. We had a case where searchd restarted immediately after receiving a request containing the number "13". This is especially unpleasant on a page where search results are not the main content. We managed with a generator, which, in case of receiving an OperationalError, quietly and peacefully returns an empty answer.

Under load

In a situation where there is a lot of data on the site and they change very often, you can’t just take and index the entire site every 5 minutes. You have to be smarter. For those who “ate a dog” in the search, this section will not be very interesting, since things are mostly well-known, but I will nevertheless describe them briefly and to the point:

main + delta + killlist. main - the main index, contains the entire site, is updated once a day. delta - contains only documents that have been updated since the last indexation of the main index. killlist - a list of documents that should be excluded from the previous index.

# получаем IP индексирующего сервера
sql_query_pre  = set @ip = substring_index(user(), '@', -1);
# для delta-индекса наполняем KILL-лист всеми
# документами, которые входят в delta-индекса +
# удаленными, их тоже надо из main убрать
sql_query_killlist = select id from ... where ... and last_updated_ts > @last_index_ts

global_idf If there are several local indexes, it is worth specifying the global_idf = 1 parameter; otherwise, the Inverse Document Frequency will be considered separately for each index, as a result of which the “rare” words from the “delta” will push upward the results that should not be there.
Several search engines. We didn’t get smart with the replication of search engine index files, just each server indexes the database separately. Data out of sync happens, and an even more stable solution was invented, but so far the hands have not reached. Solution: RT-index, which is updated simultaneously on all search engines upon receipt of a message about a change in a document. Pros: instant indexing, lack of a data rassynchron in a normal state; cons: wildly complex code to initiate sending messages, because A search document contains data from approximately 15 database tables, the need for message handlers on each search server.
Server load. Of course, it is better not to bring the CPU Load to 100%, but if this has become the norm, then there is an option max_predicted_timethat limits the theoretical query execution time. Relevance suffers, but part of the problems can be cut. It is possible, but not necessary, because "God sees everything." God is the editor, and all is the appearance, for example, of very strange "similar" ones on the page. To combat temporary overloads, it makes sense to put CONN_TIMEOUTDjango in the connection so that the search does not slow down everything else.
Administration of synonyms and exceptions lists. We decompose it on WebDAV and apply it at the nearest indexation (during index rotation).

About changing RT indices

Still, sphinxsearch is not a database. But there is a possibility of changing the data in it.

B UPDATEallows updating attributes having a fixed length. By the way, even for on-disk indexes.
In other cases, it is used REPLACEthat places the old version of the document deleted, and adds a new one at the end.
Hence the difficulties for the developers of the backend for Django: firstly, it queryset.update(field=value)works only for numeric attributes, REPLACEit must be formatted as bulk insert; secondly, REPLACEit is still more similar in syntax to INSERT, which means it must be formed using SQLInsertCompiler. In general, there is something to think about.

After using sphinxsearch for three years in production, the entire search team loved him passionately and with all his heart. Perhaps this is the only project in which all the problems are so strange and entertaining :)

DSPH-146 Baby and Window Sill
DSPH-115 Hamster is broken (search results)
DSPH-129 agreed to remove the same videos from similar ones
DSPH-118 Bad similar, very bad
DSPH-131 Again similar and again disgusting

And sphinxsearch in our Rutube is used to store and process logs in the Kibana bicycle counterpart - by the way, it works pretty fast. There will be time - we will tell about him.

Tags: