Doing the Right Did You Mean

    Zatakt: this is my first post, and the first post as always pancake :).

    Recently, the task was received to modernize the search on the site, and it so happened that it was necessary to make the “Did You Mean” functionality.

    By the way, thank you very much comrade alexblack alexblack for his article Yandex-like do-it- yourself search , without it I would be without hands :)

    Now I will start listing how I did all this. PHP, MySQL database, site language - English.
    (the correct solution is at the end :)

    0) Creating indexes I

    wrote a small script that runs across the entire content of the site (since it is all in the database) and creates a guess_word table with fields word - the word itself, search_index- the word index (soundex or metaphone, depending on the settings) and count - the frequency of occurrence of the word.

    All words were previously converted to lowercase and everything except letters was cut out of them.

    1) soudex + frequency of meeting

    If the word was not found, then we take its soudex index and look in the database, sorting back by frequency of meeting.
    It works, but it’s bad. By no means always the most frequently occurring word is the right hint.

    2) soundex + levenshtein

    Find all the words with the same soundex index, then choose from them the one with the smallest Levenshtein distance.
    Much better, but still have problems. To a standard typo, teh finds tea (and not the, for the distance between teh and the is greater than between teh and tea).

    3) metaphone

    Refused metaphone, as it gives worse results. (Although strange, of course).

    4) We calculate the correct distance between words.

    Levenshtein distance allows for letter substitution, deletion and insertion, but does not take into account the exchange of places.

    Idea A: count how many different letters there are in the word, and add the result to the distance. at the same time, in the Levenshtein distance, we make the price for removal / insertion lower than for the replacement, because removal / insertion has already been calculated.
    Result A: not that. now the closest to teh is te

    Idea B: play with the prices in levenshtein.
    Result B:a little not what i expected. For example, when raising the price of a replacement, the algorithm decides that teh is the most profitable to convert through deletion and insertion. Good, of course, but not suitable.

    Idea C: abandon Levenshtein in favor of a self-written function. Link to the code without comments (just in case) Result C: works perfectly. (for now :), maybe later there will be various inaccuracies) 5) We correct the search by the soudex index, the index does not always give the same indexes for similar words. For example, stcok and stages have the same indices, while stcok and stock are different (theaters and tetrachloride are still funny). Idea A: Divide the index into 2 parts (letter and number) and search from [letter] [number-2] to [letter] [number + 2]
    function words_dist($str_a, $str_b)
        if ($str_a == $str_b) return 0;

        // (begin of) считаем количество различных символов. в отличии от btlfsa
        // (смотреть в комментах на считает разницу в их количестве
        // например btlfsa teh:tehh = 0, words_dist teh:tehh = 1

        $arr_a = array();
        for ($i = 0; $i < strlen($str_a); $i++) {
            $arr_a[$str_a{$i}] = (array_key_exists($str_a{$i}, $arr_a) ? $arr_a[$str_a{$i}] : 0) + 1;

        $arr_b = array();
        for ($i = 0; $i < strlen($str_b); $i++) {
            $arr_b[$str_b{$i}] = (array_key_exists($str_b{$i}, $arr_b) ? $arr_b[$str_b{$i}] : 0) + 1;

        foreach($arr_a as $k=>$v) {
            if (!array_key_exists($k, $arr_b)) $arr_b[$k] = 0;

        $dst = 0;
        foreach ($arr_b as $k=>$v) $dst += abs((array_key_exists($k, $arr_a) ? $arr_a[$k] : 0) - $v);

        // (end of) считаем количество различных символов

        // считаем удаление/добавление символа более дорогостоящей операцией нежели обмен символов местами
        $dst *= 2;

        if (strlen($str_a) < strlen($str_b))
            $l = strlen($str_b)-strlen($str_a);
            for ($i = 0; $i < $l; $i++) $str_a .= ' ';
        elseif (strlen($str_b) < strlen($str_a))
            $l = strlen($str_a)-strlen($str_b);
            for ($i = 0; $i < $l; $i++) $str_b .= ' ';

        // считаем сколько символов не на своём месте
        for ($i = 0; $i < strlen($str_a); $i++) { if ($str_a{$i} != $str_b{$i}) $dst++; }

        return $dst;

    Result A: I didn’t do it, because I thought that they might be mistaken in the first letter, and then soudex is powerless.

    Idea B: Add the same first letter to the words, such as 'L'. since now the first letter is the same, entering it into the index does not make sense. $ index = substr (soundex ('L'. $ word), 1);
    Result B: exceeded my expectations. I didn’t even have to do a range search (-2 ... +2).

    *) Summary

    a) index the soundex
    b) add the same first letter to the words before indexing, $ index = substr (soundex ('L'. $ Word), 1)
    c) select from the list of possible words a samopis function, and do not use Levenshtein distance

    And also as a bonus:make a guess from the guessed word (highlight the wrong letters in red). The code is quite obvious and voluminous, so I’ll just give a link to it:


    upd: probably you should still search by range, because soundex ('Lteom') is not like soundex (' Lterm ') (but still much closer than soundex (' teom ') and soundex (' term '))

    Also popular now: