I think I began to understand what you had in mind!
Sealing a simple business; to be sealed in a search query and doubly. Read all the big web search engines today are able to correct errors in keywords in 1x and prompt queries in 2x; following them the same one wants a smaller search. Both pieces can be cleverly implemented using an open search engine nicknamed Sphinx ; in this post I’ll tell you exactly how.
Well, for did you mean ("what did you mean") and other query completion ("are you really looking for Vasya?).
Let's start with the simple ones. query completion. The user begins to enter a search line, as he types, he wants to immediately show prompts (this is so that he doesn’t look for what is unknown, but listens to “Valenki” like everyone else). Perhaps even with the number of results. How to do, where to get data from?
There are approximately two options: you can prompt user queries directly, or you can select individual keywords.
It’s especially easy to suggest keywords. An indexer with the unexpected name indexer has two equally unexpected keys: --buildstops , which causes indexer to build a list of N most common words instead of usually indexing (it's a dictionary), and to it --buildfreqs, which also forces the frequencies to be counted (it is also a frequency dictionary). We take an indexer, build 10 thousand (100 possible) most common words in our collection of documents:
We get something like this file:
Next is a matter of technology. We create an SQL label, write an import script on a dozen lines, use the usual LIKE as a hint, sort the results by word frequency. LIKE on the index will turn out quite reasonably fast, maybe. looking for the beginning of the key. For 1-letter queries, it would be nice to immediately calculate and cache the results so as not to shovel thousands of lines for each sneeze. MySQL query cache, however, is supposed to save if enabled.
Requests are prompted almost the same way. Neither the tablet nor LIKE go anywhere; only instead of single words now I want full lines. You can and should take them in your query log, and from there calculate the frequencies. The log will have to be slightly processed by the file: the line “vasya pupkin” in terms of search coincides with “Vasya! Pupkin! ”, Why consider them different requests is not very comme il faut. This is clogged with the Sphinx API BuildKeywords () method: take an arbitrary query line, build keywords on it (with case reduction and all that), restore the normalized query line. All this is best done by first setting up a persistent connection using the Open () method, otherwise it will work many times slower. Well, then on the thumb; log, file, sort, uniq -c, import script, SQL plate, LIKE, profit. The file should look something like this:
The import script from a text file about two fields in the SQL database is left as a homework for readers.
Having lost our tips, we proceed to the correction of errors. As you know, the name Britney with typos can be entered in a little less than 600 different ways . But she also has a surname. But they can search and not her at all! A query with errors will definitely not find anything; the page will be blank; Adsense / Ydirect / Unameit will display poor ads; no one clicks; your startup will burn out; there will be no one to buy commercial services about Sphinx and the project will die too. This is unacceptable, you need to urgently fix the keywords!
Of course, there is always an option to screw in ispell, aspell, hunspell, or any-fashion-today-spell. Clearly, it always rests either on the quality of the xxxspell dictionary, or stupidly in its absence for the right language. It is clear that it does not help in any way with neologisms (preved), special terms (acidium acetylosalicilium), geographical names, etc. Through this, I still want more. And the blog is not about ispell, you need to comply.
Again, a frequency dictionary is required. True, larger than 10K keywords - it’s not worthwhile to “correct” a rare correct word into a close frequent word. A dictionary of 1 million words should usually suffice, 10 million should absolutely, those. the command turns into indexer --buildstops dict.txt 10000000 --buildfreqs MYINDEXNAME (works at a speed of more than 20 MB / sec on the C2D E8500, by the way). Unlike hints, SQL searches in such a dictionary will no longer help; and there is more data, and the type of requests is not the same. But Sphinx will help.
The main idea is as follows. We generate for each word from the dictionary a set of trigrams, those. 3 consecutive characters. Index the trigrams with Sphinx. To search for replacement options, we build trigrams for the word-with-error, too, look for them in the index. There are several candidates. The more trigrams coincided, the less the word length differs, and the more often the found option is found, the better. And now we will analyze all this in more detail, using a living example.
The dictionary created by indexer --buildstops still looks like this (I chose another piece so that the words are more authentic and the example is clearer):
For each word, we need to create a unique ID, save the word itself and its frequency, build trigrams, and save all this to the database. If there are typos in the indexed database, it makes sense to drop too rare words, maybe. with a good chance this is a typo.
It is only necessary to index the field with trigrams, but for ranking candidates (those. Choosing the best correction), the word length and the frequency of its occurrences in the collection will still be required.
We identify suspicious words from the results of a search query: if there are too few search results (or none at all), we analyze the response section $ result ["words"], look at the number of documents for each word. If there are few documents, we try to correct such a word. For example, for the query “green liight” in my test index, the number of occurrences for “green” is 34421, for “liight” only 2. Which of them to proceed to correctional work is immediately clear. Specific thresholds for “few” are very individual for different collections of documents and requests. Look in your dictionary and query log, select magic constants.
We build trigrams, run a query on the trigram special index. Since the word is typed with errors, that's alltrigrams are unlikely to match. On the other hand, if only one trigram matches, such a candidate is also of little interest: this can happen only if three letters in the middle of the word coincide (and nothing else), or one letter at the beginning (and nothing else). Well, we use the quorum operator , this is exactly what it is looking for: it issues all documents where at least 2 trigrams matched. We also introduce a length limit: we assume that the length of the correct variant differs by no more than 2 characters.
A bunch of candidates found must be sorted, and choose the best from it. We recall what factors we have:
All these factors, the latest version of Sphinx is able to calculate and sort completely on the server side. The number of matched trigrams can be calculated with the ranger SPH_RANK_WORDCOUNT (as in the course of our special search, each trigram acts as a separate keyword). The difference in word length is abs (len- $ len), and the frequency is stored in the freq attribute. We calculate the factors, collect some together, and choose the best:
Hooray, it works! The word liight has successfully found the light fix. (More precisely, Sphinx finds the ID, which then gets the line “light” from the database).
This is how the demo applied to Sphinx 0.9.9-rc2 works (see the misc / suggest directory inside the archive), which you can immediately test on your data without writing any additional code :-) The
demo is immediately clear, imperfect and needs to be finished with a file. (Sorry, I could not resist.) There is a danger that part of the PHP box will not work with the Russian language, because UTF-8 is expected, substr is used, and, accordingly, without mbstring.overload = 7, nowhere. You will almost certainly have to twist FREQ_THRESHOLD, those. the minimum number of entries below which the word is considered a typo and does not fall into the Special Index; for small collections of data to lower, for large to increase. For the same reasons (so that rare debris does not rank higher than frequent garbage), it may be necessary to twist the formula for calculating myrank. For example, add an extra unit to the number of matched trigrams for frequencies differing 1000 times:
In addition, the focus with trigrams is effective, but very simple, and does not take into account much. The order of trigrams is not taken into account, although for words of reasonable length this is generally not a problem. What is more interesting, it does not take into account how people are mistaken: rearranging 2 neighboring letters by the number of trigrams is indistinguishable from replacing these letters with any (!) Others (for example light / lihgt / limit); the proximity of the letters on the keyboard is not taken into account in any way; phonetic proximity (actually / akshully) is not taken into account in any way. A fairly obvious idea for improving the quality of corrections with little blood: instead of the 1 best option, we take out 10-20 pieces, we consider the Levenshtein distance on the client, we correct the calculation results. With a lot of blood, you can just “count” a dozen or two candidates with any other self-written algorithm.
In general, the demo will work out of the box. But this is still a demo, and no one has canceled a lot of space for further creativity. Create, invent, write blogs, send patches!
Well, for did you mean ("what did you mean") and other query completion ("are you really looking for Vasya?).
Listen to your favorite song “Boots”!
Let's start with the simple ones. query completion. The user begins to enter a search line, as he types, he wants to immediately show prompts (this is so that he doesn’t look for what is unknown, but listens to “Valenki” like everyone else). Perhaps even with the number of results. How to do, where to get data from?
There are approximately two options: you can prompt user queries directly, or you can select individual keywords.
It’s especially easy to suggest keywords. An indexer with the unexpected name indexer has two equally unexpected keys: --buildstops , which causes indexer to build a list of N most common words instead of usually indexing (it's a dictionary), and to it --buildfreqs, which also forces the frequencies to be counted (it is also a frequency dictionary). We take an indexer, build 10 thousand (100 possible) most common words in our collection of documents:
$ indexer myindex --buildstops dict.txt 10000 --buildfreqs
We get something like this file:
i 9533843 and 5427048 to 5418872 the 5371581 a 4282225 you 2877338 ...
Next is a matter of technology. We create an SQL label, write an import script on a dozen lines, use the usual LIKE as a hint, sort the results by word frequency. LIKE on the index will turn out quite reasonably fast, maybe. looking for the beginning of the key. For 1-letter queries, it would be nice to immediately calculate and cache the results so as not to shovel thousands of lines for each sneeze. MySQL query cache, however, is supposed to save if enabled.
CREATE TABLE keywords ( keyword VARCHAR (255) NOT NULL, freq INTEGER NOT NULL, INDEX (keyword, freq) ); SELECT * FROM keywords WHERE keyword LIKE 'valen%' ORDER BY freq DESC LIMIT 10
Requests are prompted almost the same way. Neither the tablet nor LIKE go anywhere; only instead of single words now I want full lines. You can and should take them in your query log, and from there calculate the frequencies. The log will have to be slightly processed by the file: the line “vasya pupkin” in terms of search coincides with “Vasya! Pupkin! ”, Why consider them different requests is not very comme il faut. This is clogged with the Sphinx API BuildKeywords () method: take an arbitrary query line, build keywords on it (with case reduction and all that), restore the normalized query line. All this is best done by first setting up a persistent connection using the Open () method, otherwise it will work many times slower. Well, then on the thumb; log, file, sort, uniq -c, import script, SQL plate, LIKE, profit. The file should look something like this:
$ cl = new SphinxClient (); $ cl-> Open (); foreach ($ log as $ entry) { $ keywords = $ cl-> BuildKeywords ($ entry, "myindex", false); foreach ($ keywords as $ keyword) print $ keyword ["tokenized"]. ""; print "\ n"; }
The import script from a text file about two fields in the SQL database is left as a homework for readers.
I realized this is a hint.
Having lost our tips, we proceed to the correction of errors. As you know, the name Britney with typos can be entered in a little less than 600 different ways . But she also has a surname. But they can search and not her at all! A query with errors will definitely not find anything; the page will be blank; Adsense / Ydirect / Unameit will display poor ads; no one clicks; your startup will burn out; there will be no one to buy commercial services about Sphinx and the project will die too. This is unacceptable, you need to urgently fix the keywords!
Of course, there is always an option to screw in ispell, aspell, hunspell, or any-fashion-today-spell. Clearly, it always rests either on the quality of the xxxspell dictionary, or stupidly in its absence for the right language. It is clear that it does not help in any way with neologisms (preved), special terms (acidium acetylosalicilium), geographical names, etc. Through this, I still want more. And the blog is not about ispell, you need to comply.
Again, a frequency dictionary is required. True, larger than 10K keywords - it’s not worthwhile to “correct” a rare correct word into a close frequent word. A dictionary of 1 million words should usually suffice, 10 million should absolutely, those. the command turns into indexer --buildstops dict.txt 10000000 --buildfreqs MYINDEXNAME (works at a speed of more than 20 MB / sec on the C2D E8500, by the way). Unlike hints, SQL searches in such a dictionary will no longer help; and there is more data, and the type of requests is not the same. But Sphinx will help.
The main idea is as follows. We generate for each word from the dictionary a set of trigrams, those. 3 consecutive characters. Index the trigrams with Sphinx. To search for replacement options, we build trigrams for the word-with-error, too, look for them in the index. There are several candidates. The more trigrams coincided, the less the word length differs, and the more often the found option is found, the better. And now we will analyze all this in more detail, using a living example.
The dictionary created by indexer --buildstops still looks like this (I chose another piece so that the words are more authentic and the example is clearer):
... deal 32431 created 32429 light 32275 needed 32252 mood 32185 death 32140 behind 32136 usually 32113 action 32053 line 32052 pissed 32043 ...
For each word, we need to create a unique ID, save the word itself and its frequency, build trigrams, and save all this to the database. If there are typos in the indexed database, it makes sense to drop too rare words, maybe. with a good chance this is a typo.
CREATE TABLE suggest ( id INTEGER PRIMARY KEY AUTO_INCREMENT NOT NULL, keyword VARCHAR (255) NOT NULL, trigrams VARCHAR (255) NOT NULL, freq INTEGER NOT NULL ); INSERT INTO suggest VALUES ... (735, 'deal', '__d _de dea eal al_ l__', 32431), (736, 'created', '__c _cr cre rea eat ate ted ed_ d__', 32429), (737, 'light', '__l _li lig igh ght ht_ t__', 32275), (738, 'needed', '__n _ne nee eed ede ded ed_ d__', 32252), (739, 'mood', '__m _mo moo ood od_ d__', 32185), (740, 'death', '__d _de dea eat ath th_ h__', 32140), (741, 'behind', '__b _be beh ehi hin ind nd_ d__', 32136), (742, 'usually', '__u _us usu sua ual all lly ly_ y__', 32113), (743, 'action', '__a _ac act cti tio ion on_ n__', 32053), (744, 'line', '__l _li lin ine ine_____', 32052), (745, 'pissed', '__p _pi pis iss sse sed ed_ d__', 32043), (746, 'bye', '__b _by bye ye_ e__', 32012), ...
It is only necessary to index the field with trigrams, but for ranking candidates (those. Choosing the best correction), the word length and the frequency of its occurrences in the collection will still be required.
sql_query = SELECT id, trigrams, freq, LENGTH (keyword) AS len FROM suggest sql_attr_uint = freq sql_attr_uint = len
We identify suspicious words from the results of a search query: if there are too few search results (or none at all), we analyze the response section $ result ["words"], look at the number of documents for each word. If there are few documents, we try to correct such a word. For example, for the query “green liight” in my test index, the number of occurrences for “green” is 34421, for “liight” only 2. Which of them to proceed to correctional work is immediately clear. Specific thresholds for “few” are very individual for different collections of documents and requests. Look in your dictionary and query log, select magic constants.
We build trigrams, run a query on the trigram special index. Since the word is typed with errors, that's alltrigrams are unlikely to match. On the other hand, if only one trigram matches, such a candidate is also of little interest: this can happen only if three letters in the middle of the word coincide (and nothing else), or one letter at the beginning (and nothing else). Well, we use the quorum operator , this is exactly what it is looking for: it issues all documents where at least 2 trigrams matched. We also introduce a length limit: we assume that the length of the correct variant differs by no more than 2 characters.
$ len = strlen ("liight"); $ cl-> SetFilterRange ("len", $ len-2, $ len + 2); $ cl-> Query ('"__l _li iig igh ght ht_ ht __" / 2', 'suggest');
A bunch of candidates found must be sorted, and choose the best from it. We recall what factors we have:
- the more trigrams matched, the better;
- the smaller the word length is, the better;
- the more often the found option is found, the better.
All these factors, the latest version of Sphinx is able to calculate and sort completely on the server side. The number of matched trigrams can be calculated with the ranger SPH_RANK_WORDCOUNT (as in the course of our special search, each trigram acts as a separate keyword). The difference in word length is abs (len- $ len), and the frequency is stored in the freq attribute. We calculate the factors, collect some together, and choose the best:
$ cl-> SetMatchMode (SPH_MATCH_EXTENDED2); $ cl-> SetRankingMode (SPH_RANK_WORDCOUNT); $ cl-> SetSelect ("*, @ weight + 2-abs (len- $ len) AS myrank"); $ cl-> SetSortMode (SPH_SORT_EXTENDED, "myrank DESC, freq DESC");
Hooray, it works! The word liight has successfully found the light fix. (More precisely, Sphinx finds the ID, which then gets the line “light” from the database).
This is how the demo applied to Sphinx 0.9.9-rc2 works (see the misc / suggest directory inside the archive), which you can immediately test on your data without writing any additional code :-) The
demo is immediately clear, imperfect and needs to be finished with a file. (Sorry, I could not resist.) There is a danger that part of the PHP box will not work with the Russian language, because UTF-8 is expected, substr is used, and, accordingly, without mbstring.overload = 7, nowhere. You will almost certainly have to twist FREQ_THRESHOLD, those. the minimum number of entries below which the word is considered a typo and does not fall into the Special Index; for small collections of data to lower, for large to increase. For the same reasons (so that rare debris does not rank higher than frequent garbage), it may be necessary to twist the formula for calculating myrank. For example, add an extra unit to the number of matched trigrams for frequencies differing 1000 times:
$ cl-> SetSelect ("*, @ weight + 2-abs (len- $ len) + ln (freq) / ln (1000) AS myrank");
In addition, the focus with trigrams is effective, but very simple, and does not take into account much. The order of trigrams is not taken into account, although for words of reasonable length this is generally not a problem. What is more interesting, it does not take into account how people are mistaken: rearranging 2 neighboring letters by the number of trigrams is indistinguishable from replacing these letters with any (!) Others (for example light / lihgt / limit); the proximity of the letters on the keyboard is not taken into account in any way; phonetic proximity (actually / akshully) is not taken into account in any way. A fairly obvious idea for improving the quality of corrections with little blood: instead of the 1 best option, we take out 10-20 pieces, we consider the Levenshtein distance on the client, we correct the calculation results. With a lot of blood, you can just “count” a dozen or two candidates with any other self-written algorithm.
In general, the demo will work out of the box. But this is still a demo, and no one has canceled a lot of space for further creativity. Create, invent, write blogs, send patches!