michoello May 6, 2013 at 12:57

Processing and classification of requests. Part Three: Correcting Typos

Typos are sometimes helpful in entertaining the reader. Search engines are not yet able to evaluate humor, and words typed with errors lead them into confusion, which upsets the user. To prevent these phenomena, there are automatic “correctors” of typos, they are also spellcheckers.

More than enough has already been written about various approaches to correcting typos, so in this article I will not repeat what is already known, but I will show how to write a spellchecker from scratch - simple, but quite capable. All that is needed is a list of correct words and a bit of C ++.

I took the list of words here , but any other dictionary will do. We will store it in a prefix tree (aka Trie). Here is its somewhat exotic , not quiteoptimal , rather dangerous , but perhaps the most concise implementation:

struct trie_t: map
{
}

A structure trie_tinherits all methods map. The keys of this one map contain the letters of the words, and the values are again the structures trie_tthat inherit ... and so on. Somewhat meditatively, but for a prototype it will do.

Add the weight of the node to the structure, it will come in handy later on:

struct trie_t: map< char, trie_t >
{
    size_t w;
    trie_t() : w(0) {}
}

We will write the words into the tree using this function:

trie_t & add( trie_t & t, const string & s )
{
    t.w++;
    return !s.empty() ? add( t[ s[0] ], s.substr(1) )
                      : t['$'];
}

It works like this: we take the first character of the word, write it in the map key, we get a link to the corresponding subtree (if necessary, it is created immediately). Recursively write the rest of the line into this subtree. When the line is exhausted, add to the last node a “stop sign” - the key ' $' with an empty subtree.

After reading the words from the given file, we will enter them all into the dictionary:

trie_t glossary;
ifstream fi( argv[1] );
string word;
while( getline( fi, word ) )
    add( glossary, word );

As a result w, the number of words that passed through this node will be recorded in the field of each node of the tree.

What can be done with all this? You can print the contents of the tree by sorting through the keys and subtrees of each node with the inherited iterator:

void print( const trie_t & t, const string & prefix = "" )
{
    for( map::const_iterator
                               it = t.begin(); it != t.end(); ++it )
        if ( it->first == '$' )
            cout << prefix << endl;
        else
            print( it->second, prefix + it->first );
}

You can determine the presence or absence of a given word in the tree:

bool is_known( trie_t & t, const string & s )
{
    return !s.empty() ? is_known( t[ s[0] ], s.substr(1) )
                      : t.find('$')  != t.end();
}

In this function, although it works correctly, an error has been made for the sake of brevity. I will not correct it, since such a “clear” search will no longer be useful to us anyway.

Fuzzy search

If you imagine a tree as a kind of multiple forked road, then the word search function is a traveler walking along this road. The traveler is given a route (the search word), and in accordance with it, he goes from node to node, crossing out the letters of the route passed. If the letters are over, it remains to check whether the "stop sign" is visible.

Now imagine that a traveler, as it were, is simultaneously sent in all directions at once. Each of its clones at each fork behaves in exactly the same way, and so on, until someone comes to each of the final stops (that is, the leaves of the tree). The total number of travelers will be equal to the number of words written in the tree, and each of them will follow its own individual trajectory.

Suppose all of them were initially given the same word-route. At each transition, each traveler, as before, crosses out one letter in his copy of the route. But if the real direction of the transition does not coincide with the crossed out letter, the traveler receives a fine. As a result, at the end of the journey, everyone will have some debt. We arrange all in a row, sorted by increasing this value.

If the search word is present in the dictionary, then one traveler will go all the way without penalties - he will be the first in the sorted list. If the word is not in the dictionary, then the leader will be the one who has come the way with the minimum number of violations.

Actually, this is almost the entire algorithm. The trick is that the path traveled by the leader will be the closest correction of the given word (or the word itself, if there are no typos), and the number of violations (deviations from the given route) - the number of typos.

On your marks

It would be more correct, of course, to call the road a tree, the forks as nodes, the travelers as hypotheses, and deviations from the route as editing operations. But let them stay as they are, since they’re somehow more alive. We introduce a structure that stores all the information about the traveler:

struct rambler_t
{
    string         todo;
    string         done;
    size_t         cost;
    const trie_t * road;
    rambler_t( const string & t, const string & d, 
               const size_t c,   const trie_t * r ) 
         :  todo( t ), done( d ), cost( c ), road( r ) {}
};

The following is stored in the fields:

todo - the way to go. Initially, this is the same travel route, that is, the corrected word. At the end there is an empty string.
done - traveled part of the path. At the beginning - an empty line, at the end - a list of nodes passed, that is, a corrected word.
cost - the amount of fines imposed. At the start - zero, at the finish - how it goes.
road - location of the traveler. At the beginning - the root of the tree, at the end - one of the leaves.

So, the traveler is at the starting point, you can go:

rambler_t R( word, "", 0, &glossary );

First step

All possible directions can be sorted out using an iterator, already useful when printing a dictionary:

for( map::const_iterator
             it = R.road->begin(); it != R.road->end(); ++it )
{
    char           dest   = it->first;  // направление движения
    const trie_t * road   = it->second; // следующее распутье
    // Место для шага вперёд
}

After completing the step, the values of the structure fields will change as follows:

R.done = R.done + dest;
R.todo = R.todo.substr(1);
R.road = R.road[ dest ];

With fare payment, two options are possible. If the selected direction coincides with the "general line", then the transition is free, otherwise +1:

R.cost = R.cost + ( dest == todo[0] ? 0 : 1 );

Ok, the first step has been taken. It’s more correct to say that there are a lot of first steps at once - instead of a lone wanderer a whole team appeared:

typedef vector team_t;

It remains only to repeat step by step, increasing the team until then, until everyone reaches their point for the purpose ... Stop. There is a nuance, and not one.

The “wrong turn” described above, corresponding to replacing a letter of a word with another letter, is just one of the varieties of typos. There are at least two more: delete and insert.

Delete

This situation is “paid, but did not go”: we delete the first letter from todo, we fine ourselves, but we don’t make the transition and add nothing to done.

R.done = R.done;
R.todo = R.todo.substr(1);
R.road = R.road;
R.cost = R.cost + 1;

As a result, the letter from is todo simply lost along the way, which is equivalent to removing it.

Insert

Here, on the contrary, “running in place”: we move deep into the tree, add the letter to done, but do not delete anything from todo.

R.done = R.done + dest;
R.todo = R.todo;
R.road = R.road[ dest ];
R.cost = R.cost + 1;

The letter passed in done appears outside the plan, out of nowhere, which is equivalent to its insertion.

Fathers and Sons

Since at each step there is some ~~parallelization of the worlds,~~ an increase in the number of objects, we will not modify the parameters of the traveler, but create new ones in his image and likeness. The following function unites everything: it receives a traveler at the input, filling in the output two vectors of travelers in accordance with the given route, the keys of the tree node and the rules described above. Those who still continue to move into the
vector team , and those who have finished reached the end, that is, those who have been todo emptied, and the current tree node contains ~~bucks~~ marking the end of the word.

void step_forward( const rambler_t & R, team_t & team, team_t & finished )
{
    char           next  =       R.todo[0];    // План
    const string & todo = next ? R.todo.substr(1) : "";
    for( map::const_iterator
            it = R.road->begin(); it != R.road->end(); ++it )
    {
        const trie_t * road  = &it->second;    // Поддерево
        char           dest  =  it->first;     // Ключ
        if ( next  ==  dest )
            team.push_back( rambler_t( todo,   // Всё по плану
                                       R.done + dest,  
                                       R.cost,
                                       road ));  
        else {
            team.push_back( rambler_t( todo,   // Замена
                                       R.done + dest,  
                                       R.cost + 1,   
                                       road ));  
            team.push_back( rambler_t( R.todo, // Вставка
                                       R.done + dest,  
                                       R.cost + 1,   
                                       road ));  
            if ( !next && dest == '$' )
                finished.push_back( R );      // Приехали!                                              
        }
    }
    if ( next )  
        team.push_back( rambler_t( todo,      // Удаление
                                   R.done,         
                                   R.cost + 1, 
                                   R.road ));  
}

Natural selection

And the second nuance: avalanche-like generating at each step of new and new participants in the process, we act irrationally. Crowds idly staggering over the tree are poorly controlled, for the most part they are useless and only devour resources in vain. We introduce restrictions and arrange a competition for them - let them run around.

First, we will assign the maximum allowable amount of fines, and we will throw out everyone who exceeded it, deviating too far from the given path.

Secondly, we will drop from the distance all those lagging behind. True, there is a risk that among the expelled outsiders, someone might sprint leaders overtake the leaders at the very end of the race, but for the sake of optimization they will have to put up with it.

In order to identify those who are lagging behind, we will estimate the chances of everyone to succeed, that is, that the rest of the way will be covered without disqualification for exceeding the maximum amount. The chance will be estimated as follows:

double ramb_chances( const rambler_t & a )                
{ 
    return log( ( a.road->w + 1 ) * 
                ( a.done.size() + 1 ) ) / ( 1 << a.cost ); 
}

Why is there a logarithm? What kind of bit shifts? Why is the function like this? You can try to explain. Obviously, the greater the amount of fines already received, the less likely it is to get to the end. And the more words in the subtree, the higher the probability of finding at least one suitable word among them. And the length of the already traveled path, as it were, testifies to "endurance."

I must admit, the latter is very cruelly far-fetched. But in any heuristic, the main thing is to work. This one works. And, one way or another, we are nearing completion. Denote the maximum amount of fines as max_cost, the maximum team size as team_size.

Put the pioneer at the beginning of all roads:

team_t walkers, leads;
walkers.push_back( rambler_t( word, "", 0, &glossary ) );

We take a step, generating the following “generation”:

team_t next_g;
for( size_t i = 0; i < walkers.size(); ++i )
    step_forward( walkers[i], next_g, leads );

The leadsrecorded have completed your way to next_gget more not reached. We sort of their chances of reaching yet not descending:

sort( next_g.begin(), next_g.end(), ramb_chance_cmp );

And we repeat all this until there is still someone to move on. We ignore those whose fines have exceeded max_cost, and those who are not included in the team_sizebest:

while ( !walkers.empty() )
{
    team_t next_g;
    for( size_t i = 0; i < min( walkers.size(), team_size ); ++i )
        if ( walkers[i].cost < max_cost )
            step_forward( walkers[i], next_g, leads );
    walkers.swap( next_g );
    sort( walkers.begin(), walkers.end(), ramb_chance_cmp );
}

By the way, leadsseveral structures with the same lines may appear in done. Guess where they come from?
Leave only unique elements:

sort( leads.begin(), leads.end(), ramb_done_cmp );  // сортируем по done
leads.resize( distance( leads.begin(), 
   unique( leads.begin(), leads.end(),
           ramb_uniq ) ) );                          // убираем дубли
sort( leads.begin(), leads.end(), ramb_chance_cmp ); // сортируем обратно

That’s almost all.

Add all the above to the function spellcheck, pass our word into it, the maximum penalty and the number of candidates. You can use:

while( getline( cin, word ) )
{
    team_t fixed = spellcheck( word, word.size()/2, 512 );
    cout << word << endl;
    for( size_t i = 0; i < fixed.size(); ++i )
        cout << '\t' << fixed[i].done << ' ' << fixed[i].cost << endl;
    cout << endl;
}

The limit of 512 candidates and the maximum deviation of half the length of a given word I picked up manually. With these parameters, the algorithm performance is approximately equal to 10 words per second, he copes with the words корован, инцыклапедея, печмоданand жаднокластник, but does not turn хлебin пиво.

Now that's all. Let's get started. And here is an example of the work of our simple but effective mechanism for correcting typos:

нисложый
        несложный 2
нно
эфентиыный
        эффективный 3
михонезм
        механизм 3
спровлени
        исправление 3
опечатог
        оператор 2
        отпечаток 2
        опечатка 2

It can be seen that not all of his results are perfect, but this is fixable.

The results of the journey

I am sure that many people learned in the description of one of the varieties of informed search - the A * algorithm , or rather, its variation - the A * algorithm with iterative deepening and memory limitation. Of course, the described algorithm is only the first, rough approximation to what is actually used in battle.

What else should a real combat spellchecker be able to do?

In our reference dictionary, only the initial forms of words are present, so that мишки гаммиit will correct for мишка гаммаwhat is wrong. One must have an understanding of morphology, or include all forms of words in the dictionary. The first is better, but harder.

All words are equal. This leads to inaccuracies in choosing the best hypothesis:
```
перат
	перст 1
	пират 1
```
The problem will be solved if we add information about the frequency of use of words.

All errors are equal. From here come such uncertain results:
```
заец
	заем 1
	заяц 1
```
Corrected by gradation of fines in accordance with the probability of each possible error. Replacing яwith is еmore likely than мwith ц, therefore more likely заяц.

"Punto Switching." Everything is simple here. Adding a table of type q-й, w-ц, a-фand one other type of typing errors - Change the layout. Done.

Gluing and gluing words. This is a bit more complicated. It is necessary to teach the “travelers” to jump from the leaf nodes of the tree to its root and separately handle the gaps in their “routes”.

Contextual fixes. The same words in different situations should be corrected in different ways. For example, пришла веспаand сушите веспа, пиковый балетand проездной балет, футбольный мечand джедайский мяч. Correcting such typos requires statistics and / or grammar rules. In other words, a language model. This is a completely different story.

Side effects

In the end, it would be wrong not to mention the possibilities of the described approach that go beyond the scope of our task.

If you cancel the penalties for adding characters to the end of a given "route", then we get a solution to the problem of auto-completion .

If we insert into the program tables of phonetic correspondence of characters, we get a transliteration recognition system of any degree of ambiguity. The algorithm will find the most likely word.
```
ж → j, zh, g, dj
j → й, ж, дж, х
...
```

Transitions on the tree according to the rules specified by the correspondence of numbers and letters will turn the spellchecker into a T9 set system :
```
	2 → абвг
	...
	9 → ьэюя
```

The cost of the error, calculated not on the probabilities of typos, but exclusively at the distances between the keys (or rather, their images on the touch screen) is the basis of the system of "moving" word input - Swype .

The task of fuzzy search is part of speech recognition systems, handwritten texts ... However, enough. Very much has already been written about this.

The full text of the program can be found here . The program is slightly longer than the famous nanospellchecker Peter Norvig, but it significantly surpasses it in its capabilities, not inferior in performance.

Thanks for attention!

Useless links

Source of inspiration: Online Spelling Correction for Query Completion
The world's shortest spellchecker from Peter Norwig
1200 ways to write the word "classmates" - can be used to calculate the error model
Alternative approaches: one , two , three

Mikhail Dolinin,
head of the search query processing group

Tags: