# Processing and classification of requests. Part Three: Correcting Typos

Typos are sometimes helpful in entertaining the reader. Search engines are not yet able to evaluate humor, and words typed with errors lead them into confusion, which upsets the user. To prevent these phenomena, there are automatic “correctors” of typos, they are also spellcheckers.

More than enough has already been written about various approaches to correcting typos, so in this article I will not repeat what is already known, but I will show how to write a spellchecker from scratch - simple, but quite capable. All that is needed is a list of correct words and a bit of C ++.

I took the list of words here , but any other dictionary will do. We will store it in a prefix tree (aka Trie). Here is its somewhat exotic , not quiteoptimal , rather dangerous , but perhaps the most concise implementation:

A structure

Add the weight of the node to the structure, it will come in handy later on:

We will write the words into the tree using this function:

It works like this: we take the first character of the word, write it in the map key, we get a link to the corresponding subtree (if necessary, it is created immediately). Recursively write the rest of the line into this subtree. When the line is exhausted, add to the last node a “stop sign” - the key '

After reading the words from the given file, we will enter them all into the dictionary:

As a result

What can be done with all this? You can print the contents of the tree by sorting through the keys and subtrees of each node with the inherited iterator:

You can determine the presence or absence of a given word in the tree:

In this function, although it works correctly, an error has been made for the sake of brevity. I will not correct it, since such a “clear” search will no longer be useful to us anyway.

If you imagine a tree as a kind of multiple forked road, then the word search function is a traveler walking along this road. The traveler is given a route (the search word), and in accordance with it, he goes from node to node, crossing out the letters of the route passed. If the letters are over, it remains to check whether the "stop sign" is visible.

Now imagine that a traveler, as it were, is simultaneously sent in all directions at once. Each of its clones at each fork behaves in exactly the same way, and so on, until someone comes to each of the final stops (that is, the leaves of the tree). The total number of travelers will be equal to the number of words written in the tree, and each of them will follow its own individual trajectory.

Suppose all of them were initially given the same word-route. At each transition, each traveler, as before, crosses out one letter in his copy of the route. But if the real direction of the transition does not coincide with the crossed out letter, the traveler receives a fine. As a result, at the end of the journey, everyone will have some debt. We arrange all in a row, sorted by increasing this value.

If the search word is present in the dictionary, then one traveler will go all the way without penalties - he will be the first in the sorted list. If the word is not in the dictionary, then the leader will be the one who has come the way with the minimum number of violations.

Actually, this is almost the entire algorithm. The trick is that the path traveled by the leader will be the closest correction of the given word (or the word itself, if there are no typos), and the number of violations (deviations from the given route) - the number of typos.

It would be more correct, of course, to call the road a tree, the forks as nodes, the travelers as hypotheses, and deviations from the route as editing operations. But let them stay as they are, since they’re somehow more alive. We introduce a structure that stores all the information about the traveler:

The following is stored in the fields:

So, the traveler is at the starting point, you can go:

All possible directions can be sorted out using an iterator, already useful when printing a dictionary:

After completing the step, the values of the structure fields will change as follows:

With fare payment, two options are possible. If the selected direction coincides with the "general line", then the transition is free, otherwise +1:

Ok, the first step has been taken. It’s more correct to say that there are a lot of first steps at once - instead of a lone wanderer a whole team appeared:

It remains only to repeat step by step, increasing the team until then, until everyone reaches their point for the purpose ... Stop. There is a nuance, and not one.

The “wrong turn” described above, corresponding to replacing a letter of a word with another letter, is just one of the varieties of typos. There are at least two more: delete and insert.

This situation is “paid, but did not go”: we delete the first letter from

As a result, the letter from is

Here, on the contrary, “running in place”: we move deep into the tree, add the letter to

The letter passed in

Since at each step there is some~~parallelization of the worlds,~~ an increase in the number of objects, we will not modify the parameters of the traveler, but create new ones in his image and likeness. The following function unites everything: it receives a traveler at the input, filling in the output two vectors of travelers in accordance with the given route, the keys of the tree node and the rules described above. Those who still continue to move into the

vector~~bucks~~ marking the end of the word.

And the second nuance: avalanche-like generating at each step of new and new participants in the process, we act irrationally. Crowds idly staggering over the tree are poorly controlled, for the most part they are useless and only devour resources in vain. We introduce restrictions and arrange a competition for them - let them run around.

First, we will assign the maximum allowable amount of fines, and we will throw out everyone who exceeded it, deviating too far from the given path.

Secondly, we will drop from the distance all those lagging behind. True, there is a risk that among the expelled outsiders, someone might sprint leaders overtake the leaders at the very end of the race, but for the sake of optimization they will have to put up with it.

In order to identify those who are lagging behind, we will estimate the chances of everyone to succeed, that is, that the rest of the way will be covered without disqualification for exceeding the maximum amount. The chance will be estimated as follows:

Why is there a logarithm? What kind of bit shifts? Why is the function like this? You can try to explain. Obviously, the greater the amount of fines already received, the less likely it is to get to the end. And the more words in the subtree, the higher the probability of finding at least one suitable word among them. And the length of the already traveled path, as it were, testifies to "endurance."

I must admit, the latter is very cruelly far-fetched. But in any heuristic, the main thing is to work. This one works. And, one way or another, we are nearing completion. Denote the maximum amount of fines as

Put the pioneer at the beginning of all roads:

We take a step, generating the following “generation”:

The

And we repeat all this until there is still someone to move on. We ignore those whose fines have exceeded

By the way,

Leave only unique elements:

That’s almost all.

Add all the above to the function

The limit of 512 candidates and the maximum deviation of half the length of a given word I picked up manually. With these parameters, the algorithm performance is approximately equal to 10 words per second, he copes with the words

Now that's all. Let's get started. And here is an example of the work of our simple but effective mechanism for correcting typos:

It can be seen that not all of his results are perfect, but this is fixable.

I am sure that many people learned in the description of one of the varieties of informed search - the A * algorithm , or rather, its variation - the A * algorithm with iterative deepening and memory limitation. Of course, the described algorithm is only the first, rough approximation to what is actually used in battle.

What else should a real combat spellchecker be able to do?

In the end, it would be wrong not to mention the possibilities of the described approach that go beyond the scope of our task.

The full text of the program can be found here . The program is slightly longer than the famous nanospellchecker Peter Norvig, but it significantly surpasses it in its capabilities, not inferior in performance.

Thanks for attention!

More than enough has already been written about various approaches to correcting typos, so in this article I will not repeat what is already known, but I will show how to write a spellchecker from scratch - simple, but quite capable. All that is needed is a list of correct words and a bit of C ++.

I took the list of words here , but any other dictionary will do. We will store it in a prefix tree (aka Trie). Here is its somewhat exotic , not quiteoptimal , rather dangerous , but perhaps the most concise implementation:

`struct trie_t: map`
{
}

A structure

`trie_t`

inherits all methods `map`

. The keys of this one `map `

contain the letters of the words, and the values are again the structures `trie_t`

that inherit ... and so on. Somewhat meditatively, but for a prototype it will do. Add the weight of the node to the structure, it will come in handy later on:

```
struct trie_t: map< char, trie_t >
{
size_t w;
trie_t() : w(0) {}
}
```

We will write the words into the tree using this function:

```
trie_t & add( trie_t & t, const string & s )
{
t.w++;
return !s.empty() ? add( t[ s[0] ], s.substr(1) )
: t['$'];
}
```

It works like this: we take the first character of the word, write it in the map key, we get a link to the corresponding subtree (if necessary, it is created immediately). Recursively write the rest of the line into this subtree. When the line is exhausted, add to the last node a “stop sign” - the key '

`$`

' with an empty subtree. After reading the words from the given file, we will enter them all into the dictionary:

```
trie_t glossary;
ifstream fi( argv[1] );
string word;
while( getline( fi, word ) )
add( glossary, word );
```

As a result

`w`

, the number of words that passed through this node will be recorded in the field of each node of the tree. What can be done with all this? You can print the contents of the tree by sorting through the keys and subtrees of each node with the inherited iterator:

```
void print( const trie_t & t, const string & prefix = "" )
{
for( map
```::const_iterator
it = t.begin(); it != t.end(); ++it )
if ( it->first == '$' )
cout << prefix << endl;
else
print( it->second, prefix + it->first );
}

You can determine the presence or absence of a given word in the tree:

```
bool is_known( trie_t & t, const string & s )
{
return !s.empty() ? is_known( t[ s[0] ], s.substr(1) )
: t.find('$') != t.end();
}
```

In this function, although it works correctly, an error has been made for the sake of brevity. I will not correct it, since such a “clear” search will no longer be useful to us anyway.

#### Fuzzy search

If you imagine a tree as a kind of multiple forked road, then the word search function is a traveler walking along this road. The traveler is given a route (the search word), and in accordance with it, he goes from node to node, crossing out the letters of the route passed. If the letters are over, it remains to check whether the "stop sign" is visible.

Now imagine that a traveler, as it were, is simultaneously sent in all directions at once. Each of its clones at each fork behaves in exactly the same way, and so on, until someone comes to each of the final stops (that is, the leaves of the tree). The total number of travelers will be equal to the number of words written in the tree, and each of them will follow its own individual trajectory.

Suppose all of them were initially given the same word-route. At each transition, each traveler, as before, crosses out one letter in his copy of the route. But if the real direction of the transition does not coincide with the crossed out letter, the traveler receives a fine. As a result, at the end of the journey, everyone will have some debt. We arrange all in a row, sorted by increasing this value.

If the search word is present in the dictionary, then one traveler will go all the way without penalties - he will be the first in the sorted list. If the word is not in the dictionary, then the leader will be the one who has come the way with the minimum number of violations.

Actually, this is almost the entire algorithm. The trick is that the path traveled by the leader will be the closest correction of the given word (or the word itself, if there are no typos), and the number of violations (deviations from the given route) - the number of typos.

#### On your marks

It would be more correct, of course, to call the road a tree, the forks as nodes, the travelers as hypotheses, and deviations from the route as editing operations. But let them stay as they are, since they’re somehow more alive. We introduce a structure that stores all the information about the traveler:

```
struct rambler_t
{
string todo;
string done;
size_t cost;
const trie_t * road;
rambler_t( const string & t, const string & d,
const size_t c, const trie_t * r )
: todo( t ), done( d ), cost( c ), road( r ) {}
};
```

The following is stored in the fields:

`todo`

- the way to go. Initially, this is the same travel route, that is, the corrected word. At the end there is an empty string.`done`

- traveled part of the path. At the beginning - an empty line, at the end - a list of nodes passed, that is, a corrected word.`cost`

- the amount of fines imposed. At the start - zero, at the finish - how it goes.`road`

- location of the traveler. At the beginning - the root of the tree, at the end - one of the leaves.

So, the traveler is at the starting point, you can go:

```
rambler_t R( word, "", 0, &glossary );
```

#### First step

All possible directions can be sorted out using an iterator, already useful when printing a dictionary:

`for( map`::const_iterator
it = R.road->begin(); it != R.road->end(); ++it )
{
char dest = it->first; // направление движения
const trie_t * road = it->second; // следующее распутье
// Место для шага вперёд
}

After completing the step, the values of the structure fields will change as follows:

```
R.done = R.done + dest;
R.todo = R.todo.substr(1);
R.road = R.road[ dest ];
```

With fare payment, two options are possible. If the selected direction coincides with the "general line", then the transition is free, otherwise +1:

```
R.cost = R.cost + ( dest == todo[0] ? 0 : 1 );
```

Ok, the first step has been taken. It’s more correct to say that there are a lot of first steps at once - instead of a lone wanderer a whole team appeared:

`typedef vector` team_t;

It remains only to repeat step by step, increasing the team until then, until everyone reaches their point for the purpose ... Stop. There is a nuance, and not one.

The “wrong turn” described above, corresponding to replacing a letter of a word with another letter, is just one of the varieties of typos. There are at least two more: delete and insert.

#### Delete

This situation is “paid, but did not go”: we delete the first letter from

`todo`

, we fine ourselves, but we don’t make the transition and add nothing to `done`

.```
R.done = R.done;
R.todo = R.todo.substr(1);
R.road = R.road;
R.cost = R.cost + 1;
```

As a result, the letter from is

`todo `

simply lost along the way, which is equivalent to removing it.#### Insert

Here, on the contrary, “running in place”: we move deep into the tree, add the letter to

`done`

, but do not delete anything from `todo`

.```
R.done = R.done + dest;
R.todo = R.todo;
R.road = R.road[ dest ];
R.cost = R.cost + 1;
```

The letter passed in

`done `

appears outside the plan, out of nowhere, which is equivalent to its insertion.#### Fathers and Sons

Since at each step there is some

vector

`team `

, and those who have `finished `

reached the end, that is, those who have been `todo `

emptied, and the current tree node contains ```
void step_forward( const rambler_t & R, team_t & team, team_t & finished )
{
char next = R.todo[0]; // План
const string & todo = next ? R.todo.substr(1) : "";
for( map
```::const_iterator
it = R.road->begin(); it != R.road->end(); ++it )
{
const trie_t * road = &it->second; // Поддерево
char dest = it->first; // Ключ
if ( next == dest )
team.push_back( rambler_t( todo, // Всё по плану
R.done + dest,
R.cost,
road ));
else {
team.push_back( rambler_t( todo, // Замена
R.done + dest,
R.cost + 1,
road ));
team.push_back( rambler_t( R.todo, // Вставка
R.done + dest,
R.cost + 1,
road ));
if ( !next && dest == '$' )
finished.push_back( R ); // Приехали!
}
}
if ( next )
team.push_back( rambler_t( todo, // Удаление
R.done,
R.cost + 1,
R.road ));
}

#### Natural selection

And the second nuance: avalanche-like generating at each step of new and new participants in the process, we act irrationally. Crowds idly staggering over the tree are poorly controlled, for the most part they are useless and only devour resources in vain. We introduce restrictions and arrange a competition for them - let them run around.

First, we will assign the maximum allowable amount of fines, and we will throw out everyone who exceeded it, deviating too far from the given path.

Secondly, we will drop from the distance all those lagging behind. True, there is a risk that among the expelled outsiders, someone might sprint leaders overtake the leaders at the very end of the race, but for the sake of optimization they will have to put up with it.

In order to identify those who are lagging behind, we will estimate the chances of everyone to succeed, that is, that the rest of the way will be covered without disqualification for exceeding the maximum amount. The chance will be estimated as follows:

```
double ramb_chances( const rambler_t & a )
{
return log( ( a.road->w + 1 ) *
( a.done.size() + 1 ) ) / ( 1 << a.cost );
}
```

Why is there a logarithm? What kind of bit shifts? Why is the function like this? You can try to explain. Obviously, the greater the amount of fines already received, the less likely it is to get to the end. And the more words in the subtree, the higher the probability of finding at least one suitable word among them. And the length of the already traveled path, as it were, testifies to "endurance."

I must admit, the latter is very cruelly far-fetched. But in any heuristic, the main thing is to work. This one works. And, one way or another, we are nearing completion. Denote the maximum amount of fines as

`max_cost`

, the maximum team size as `team_size`

. Put the pioneer at the beginning of all roads:

```
team_t walkers, leads;
walkers.push_back( rambler_t( word, "", 0, &glossary ) );
```

We take a step, generating the following “generation”:

```
team_t next_g;
for( size_t i = 0; i < walkers.size(); ++i )
step_forward( walkers[i], next_g, leads );
```

The

`leads`

recorded have completed your way to `next_g`

get more not reached. We sort of their chances of reaching yet not descending:```
sort( next_g.begin(), next_g.end(), ramb_chance_cmp );
```

And we repeat all this until there is still someone to move on. We ignore those whose fines have exceeded

`max_cost`

, and those who are not included in the `team_size`

best:```
while ( !walkers.empty() )
{
team_t next_g;
for( size_t i = 0; i < min( walkers.size(), team_size ); ++i )
if ( walkers[i].cost < max_cost )
step_forward( walkers[i], next_g, leads );
walkers.swap( next_g );
sort( walkers.begin(), walkers.end(), ramb_chance_cmp );
}
```

By the way,

`leads`

several structures with the same lines may appear in `done`

. Guess where they come from? Leave only unique elements:

```
sort( leads.begin(), leads.end(), ramb_done_cmp ); // сортируем по done
leads.resize( distance( leads.begin(),
unique( leads.begin(), leads.end(),
ramb_uniq ) ) ); // убираем дубли
sort( leads.begin(), leads.end(), ramb_chance_cmp ); // сортируем обратно
```

That’s almost all.

Add all the above to the function

`spellcheck`

, pass our word into it, the maximum penalty and the number of candidates. You can use:```
while( getline( cin, word ) )
{
team_t fixed = spellcheck( word, word.size()/2, 512 );
cout << word << endl;
for( size_t i = 0; i < fixed.size(); ++i )
cout << '\t' << fixed[i].done << ' ' << fixed[i].cost << endl;
cout << endl;
}
```

The limit of 512 candidates and the maximum deviation of half the length of a given word I picked up manually. With these parameters, the algorithm performance is approximately equal to 10 words per second, he copes with the words

`корован`

, `инцыклапедея`

, `печмодан`

and `жаднокластник`

, but does not turn `хлеб`

in `пиво`

. Now that's all. Let's get started. And here is an example of the work of our simple but effective mechanism for correcting typos:

```
нисложый
несложный 2
нно
эфентиыный
эффективный 3
михонезм
механизм 3
спровлени
исправление 3
опечатог
оператор 2
отпечаток 2
опечатка 2
```

It can be seen that not all of his results are perfect, but this is fixable.

#### The results of the journey

I am sure that many people learned in the description of one of the varieties of informed search - the A * algorithm , or rather, its variation - the A * algorithm with iterative deepening and memory limitation. Of course, the described algorithm is only the first, rough approximation to what is actually used in battle.

What else should a real combat spellchecker be able to do?

- In our reference dictionary, only the initial forms of words are present, so that
`мишки гамми`

it will correct for`мишка гамма`

what is wrong. One must have an understanding of morphology, or include all forms of words in the dictionary. The first is better, but harder.

- All words are equal. This leads to inaccuracies in choosing the best hypothesis:

The problem will be solved if we add information about the frequency of use of words.`перат перст 1 пират 1`

- All errors are equal. From here come such uncertain results:

Corrected by gradation of fines in accordance with the probability of each possible error. Replacing`заец заем 1 заяц 1`

`я`

with is`е`

more likely than`м`

with`ц`

, therefore more likely`заяц`

.

- "Punto Switching." Everything is simple here. Adding a table of type
`q-й`

,`w-ц`

,`a-ф`

and one other type of typing errors - Change the layout. Done.

- Gluing and gluing words. This is a bit more complicated. It is necessary to teach the “travelers” to jump from the leaf nodes of the tree to its root and separately handle the gaps in their “routes”.

- Contextual fixes. The same words in different situations should be corrected in different ways. For example,
`пришла веспа`

and`сушите веспа`

,`пиковый балет`

and`проездной балет`

,`футбольный меч`

and`джедайский мяч`

. Correcting such typos requires statistics and / or grammar rules. In other words, a language model. This is a completely different story.

#### Side effects

In the end, it would be wrong not to mention the possibilities of the described approach that go beyond the scope of our task.

- If you cancel the penalties for adding characters to the end of a given "route", then we get a solution to the problem of
**auto-completion**.

- If we insert into the program tables of phonetic correspondence of characters, we get a
**transliteration**recognition system of any degree of ambiguity. The algorithm will find the most likely word.`ж → j, zh, g, dj j → й, ж, дж, х ...`

- Transitions on the tree according to the rules specified by the correspondence of numbers and letters will turn the spellchecker into a
**T9**set system :`2 → абвг ... 9 → ьэюя`

- The cost of the error, calculated not on the probabilities of typos, but exclusively at the distances between the keys (or rather, their images on the touch screen) is the basis of the system of "moving" word input -
**Swype**.

- The task of fuzzy search is part of speech recognition systems, handwritten texts ... However, enough. Very much has already been written about this.

The full text of the program can be found here . The program is slightly longer than the famous nanospellchecker Peter Norvig, but it significantly surpasses it in its capabilities, not inferior in performance.

Thanks for attention!

#### Useless links

- Source of inspiration: Online Spelling Correction for Query Completion
- The world's shortest spellchecker from Peter Norwig
- 1200 ways to write the word "classmates" - can be used to calculate the error model
- Alternative approaches: one , two , three

*Mikhail Dolinin,*

head of the search query processing grouphead of the search query processing group