mephistopheies November 20, 2013 at 19:27

Automatic Spell Checking, Noisy Channel Model

Good day. The other day I had a task to implement an algorithm for post-processing the results of optical text recognition. To solve this problem, one of the models for checking spelling in the text was not bad, although of course it was slightly modified to fit the context of the task. This post will be devoted to the Noisy Channel model , which allows automatic spell checking, we will study the mathematical model, write some code in c #, train the model based on Peter Norwig , and in the end we will test what we get.

Mathematical Model - 1

To begin with the statement of the problem. So you want to write some word w consisting of m letters, but in some way unknown to you, the word x consisting of n letters comes out on paper . By the way, you are exactly the same noisy channel , a channel for transmitting information with noise, which distorted the correct word w (from the universe of correct words) to the wrong x (the set of all written words).

We want to find the word that you most likely meant when writing the word x . Let us write this thought mathematically, the model in its idea is similar to the model of the naive Bayes classifieralthough even simpler:

V is a list of all words in a natural language.

Next, using the Bayesian theorem, we expand the cause and effect, we can remove the total probability x from the denominator, because when argmax, it does not depend on w :

P (w) is the a priori probability of the word w in the language; this member is a statistical model of the natural language ( language model ), we will use the unigram model , although of course you have the right to use more complex models; also note that the value of this member is easily calculated from the database of words in the language;
P (x | w) - the probability that the correct word w was mistakenly written as x , this member is called the channel model ; in principle, having a sufficiently large base that contains all the ways to make mistakes when writing each word of the language, then calculating this term would not cause difficulties, but unfortunately there is no such a large base, but there are similar bases of a smaller size, so you have to get it somehow (for example here is a base of 333333 words of the English language, and only for 7481 there are words with errors).

Calculation of the probability value P (x | w)

Here the Damerau-Levenshtein distance comes to our aid - this is a measure between two sequences of characters, defined as the minimum number of operations to insert, delete, replace and rearrange neighboring characters to cast the source string to the target string . We will continue to use the Levenshtein distance , which is different in that it does not support the operation of rearranging neighboring characters, so it will be easier. Both of these algorithms with examples are well described in here , so I won’t repeat myself, but I will immediately give the code:

Levenshtein distance

public static int LevenshteinDistance(string s, string t)
{
    int[,] d = new int[s.Length + 1, t.Length + 1];
    for (int i = 0; i < s.Length + 1; i++)
    {
        d[i, 0] = i;
    }
    for (int j = 1; j < t.Length + 1; j++)
    {
        d[0, j] = j;
    }
    for (int i = 1; i <= s.Length; i++)
    {
        for (int j = 1; j <= t.Length; j++)
        {
            d[i, j] = (new int[]
            {
                d[i - 1, j] + 1, // del
                d[i, j - 1] + 1, // ins
                d[i - 1, j - 1] + (s[i - 1] == t[j - 1] ? 0 : 1) // sub
            }).Min();
        }
    }
    return d[s.Length, t.Length];
}

This function tells us how many delete, insert and replace operations we need to perform to bring one word to another, but this is not enough for us, but we would like to get a list of these operations as well, let's call it the backtrace algorithm. We need to modify the above code so that, when calculating the matrix of distances d , the matrix of operations b is also written . Consider an example for the words ca and abc :

d	b (0 - delete from source , left; 1 - insert from target to source ; 2 - replace the character in source with the character from target )

As you remember, the value of the cell (i, j) in the distance matrix d is calculated as follows:

d[i, j] = (new int[]
{
    d[i - 1, j] + 1, // del - 0
    d[i, j - 1] + 1, // ins - 1
    d[i - 1, j - 1] + (s[i - 1] == t[j - 1] ? 0 : 1) // sub - 2
}).Min();

It remains for us to write the index of the operation (0 for deletion, 1 for insertion and 2 for replacement) in cell (i, j) of the matrix of operations b , respectively, this piece of code is converted as follows:

IList vals = new List()
{
    d[i - 1, j] + 1, // del
    d[i, j - 1] + 1, // ins
    d[i - 1, j - 1] + (s[i - 1] == t[j - 1] ? 0 : 1) // sub
};
d[i, j] = vals.Min();
int idx = vals.IndexOf(d[i, j]);
b[i, j] = idx;

Once both matrices are full, it is not difficult to calculate the backtrace (arrows in the pictures above). This is the path from the lower right cell of the operations matrix, along the path of the least cost of the distance matrix. We describe the algorithm:

denote the lower right cell as current
do one of the following
- if deletion, then write the deleted character, and move the current cell up (red arrow)
- if the insert, then write the inserted character and move the current cell to the left (red arrow)
- if the replacement, as well as the replaced characters are not equal, then write the replaced characters and move the current cell left and up (red arrow)
- if replacement, but replaced characters are equal, then only move the current cell to the left and up (blue arrow)
if the number of recorded operations is not equal to the Levenshtein distance, then go back one point, otherwise stop

As a result, we get the following function that calculates the Levenshtein distance, as well as backtrace:

Levenshtein distance with backtrace

//del - 0, ins - 1, sub - 2 
public static Tuple>> LevenshteinDistanceWithBacktrace(string s, string t)
{
    int[,] d = new int[s.Length + 1, t.Length + 1];
    int[,] b = new int[s.Length + 1, t.Length + 1];
    for (int i = 0; i < s.Length + 1; i++)
    {
        d[i, 0] = i;
    }
    for (int j = 1; j < t.Length + 1; j++)
    {
        d[0, j] = j;
        b[0, j] = 1;
    }
    for (int i = 1; i <= s.Length; i++)
    {
        for (int j = 1; j <= t.Length; j++)
        {
            IList vals = new List()
            {
                d[i - 1, j] + 1, // del
                d[i, j - 1] + 1, // ins
                d[i - 1, j - 1] + (s[i - 1] == t[j - 1] ? 0 : 1) // sub
            };
            d[i, j] = vals.Min();
            int idx = vals.IndexOf(d[i, j]);
            b[i, j] = idx;
        }
    }
    List> bt = new List>();
    int x = s.Length;
    int y = t.Length;
    while (bt.Count != d[s.Length, t.Length])
    {
        switch (b[x, y])
        {
            case 0:
                x--;
                bt.Add(new Tuple(0, s[x].ToString()));
                break;
            case 1:
                y--;
                bt.Add(new Tuple(1, t[y].ToString()));
                break;
            case 2:
                x--;
                y--;
                if (s[x] != t[y])
                {
                    bt.Add(new Tuple(2, s[x] + "" + t[y]));
                }
                break;
        }
    }
    bt.Reverse();
    return new Tuple>>(d[s.Length, t.Length], bt);
}

This function returns a tuple, in the first element of which the Levenshtein distance is recorded, and in the second, a list of pairs , the string consists of one character for the delete and insert operations, and two characters for the replacement operation (implies replacing the first character with the second).

PS: an attentive reader will notice that from the lower right cell often there are several ways to move along the path of the least cost, this is one way to increase the selection, but for simplicity we will also omit it.

Mathematical Model - 2

Now we describe the result obtained above in the language of formulas. For two words x and w, we can calculate the list of operations necessary to reduce the first word to the second, denote the list of operations by the letter f , then the probability of writing the word x as w will be equal to the probability of producing the entire list of errors f , provided that we wrote exactly x , and implied w :

Here simplifications begin, similar to those that were in the naive Bayes classifier :

the order of the error operations does not matter
the error will not depend on what word we wrote and on what we meant

Now, in order to calculate the probability of errors (regardless of what words they were made in), it is enough to have on hand any base of words with their erroneous spelling. We write the final formula, in order to avoid working with numbers close to zero, we will work in log space:

So, what do we have? If we have a sufficient set of texts on hand, we can calculate the a priori probabilities of words in the language; also having a base of words on hand with their erroneous spelling, we can calculate the probability of errors, these two bases are enough to implement the model.

Error Probability Blur

When we run through all the word base with argmax , we will never stumble upon words with zero probability. But when calculating editing operations to bring the word x to the word w , operations may arise that were not encountered in our error database. In this case, additive smoothing or Laplace blur will help us (it was also used in the naive Bayes classifier ). Let me remind you the formula in the context of the current task. If some correction operation f occurs n times in the database , while the total number of errors in the database is m , and the correction types are t (for example, for replacement, not how many times the replacement occurs "a by b ", and how many unique pairs are" * by * "), then the blurred probability is as follows, where k is the blur coefficient:

Then the probability of an operation that has never been encountered in the training base ( n = 0 ) will be:

Speed

A natural question arises, what about the speed of the algorithm, because we have to run through the entire database of words, and this is hundreds of thousands of calls to the function of calculating the Levenshtein distance with a backtrace, as well as calculate the probability of error for all words (the sum of numbers, if stored in the database are pre-calculated logarithms). The following statistical fact comes to the rescue:

80% of all printing errors are within 1 editing operation, i.e. Levenshtein distances equal to unity
almost all printing errors are within 2 editing operations

Well, then you can come up with various algorithmic tricks. I used an obvious and very simple way to speed up the algorithm. Obviously, if I only need words in no more than t editing operations from the current word, then their length differs from the current one by no more than t. When initializing the class, I create a hash table in which the keys are the lengths of words, and the values are the sets of words of this length, this can significantly reduce the search space.

The code

I will give the code of the NoisyChannel class that I got:

Noisychannel

public class NoisyChannel
{
    #region vars
    private string[] _words = null;
    private double[] _wLogPriors = null;        
    private IDictionary> _wordLengthDictionary = null; //length of word - word indices 
    private IDictionary> _mistakeLogProbs = null;
    private double _lf = 1d;
    private IDictionary _mNorms = null;
    #endregion
    #region ctor
    public NoisyChannel(string[] words, long[] wordFrequency, 
                        IDictionary> mistakeFrequency, 
                        int mistakeProbSmoothing = 1)
    {
        _words = words;
        _wLogPriors = new double[_words.Length];
        _wordLengthDictionary = new SortedDictionary>();
        double wNorm = wordFrequency.Sum();
        for (int i = 0; i < _words.Length; i++)
        {
            _wLogPriors[i] = Math.Log((wordFrequency[i] + 0d)/wNorm);
            int wl = _words[i].Length;
            if (!_wordLengthDictionary.ContainsKey(wl))
            {
                _wordLengthDictionary.Add(wl, new List());
            }
            _wordLengthDictionary[wl].Add(i);
        }
        _lf = mistakeProbSmoothing;
        _mistakeLogProbs = new Dictionary>();
        _mNorms = new Dictionary();
        foreach (int mType in mistakeFrequency.Keys)
        {
            int mNorm = mistakeFrequency[mType].Sum(m => m.Value);
            _mNorms.Add(mType, mNorm);
            int mUnique = mistakeFrequency[mType].Count;
            _mistakeLogProbs.Add(mType, new Dictionary());
            foreach (string m in mistakeFrequency[mType].Keys)
            {
                _mistakeLogProbs[mType].Add(m,
                        Math.Log((mistakeFrequency[mType][m] + _lf)/
                                    (mNorm + _lf*mUnique))
                    );
            }
        }
    }
    #endregion
    #region correction
    public IDictionary GetCandidates(string s, int maxEditDistance = 2)
    {
        IDictionary candidates = new Dictionary();
        IList dists = new List();
        for (int i = s.Length - maxEditDistance; i <= s.Length + maxEditDistance; i++)
        {
            if (i >= 0)
            {
                dists.Add(i);
            }
        }
        foreach (int dist in dists)
        {
            foreach (int tIdx in _wordLengthDictionary[dist])
            {
                string t = _words[tIdx];
                Tuple>> d = LevenshteinDistanceWithBacktrace(s, t);
                if (d.Item1 > maxEditDistance)
                {
                    continue;
                }
                double p = _wLogPriors[tIdx];
                foreach (Tuple m in d.Item2)
                {
                    if (!_mistakeLogProbs[m.Item1].ContainsKey(m.Item2))
                    {
                        p += _lf/(_mNorms[m.Item1] + _lf*_mistakeLogProbs[m.Item1].Count);
                    }
                    else
                    {
                        p += _mistakeLogProbs[m.Item1][m.Item2];
                    }
                }
                candidates.Add(_words[tIdx], p);
            }
        }
        candidates = candidates.OrderByDescending(c => c.Value).ToDictionary(c => c.Key, c => c.Value);
        return candidates;
    }
    #endregion
    #region static helper
    //del - 0, ins - 1, sub - 2 
    public static Tuple>> LevenshteinDistanceWithBacktrace(string s, string t)
    {
        int[,] d = new int[s.Length + 1, t.Length + 1];
        int[,] b = new int[s.Length + 1, t.Length + 1];
        for (int i = 0; i < s.Length + 1; i++)
        {
            d[i, 0] = i;
        }
        for (int j = 1; j < t.Length + 1; j++)
        {
            d[0, j] = j;
            b[0, j] = 1;
        }
        for (int i = 1; i <= s.Length; i++)
        {
            for (int j = 1; j <= t.Length; j++)
            {
                IList vals = new List()
                {
                    d[i - 1, j] + 1, // del
                    d[i, j - 1] + 1, // ins
                    d[i - 1, j - 1] + (s[i - 1] == t[j - 1] ? 0 : 1) // sub
                };
                d[i, j] = vals.Min();
                int idx = vals.IndexOf(d[i, j]);
                b[i, j] = idx;
            }
        }
        List> bt = new List>();
        int x = s.Length;
        int y = t.Length;
        while (bt.Count != d[s.Length, t.Length])
        {
            switch (b[x, y])
            {
                case 0:
                    x--;
                    bt.Add(new Tuple(0, s[x].ToString()));
                    break;
                case 1:
                    y--;
                    bt.Add(new Tuple(1, t[y].ToString()));
                    break;
                case 2:
                    x--;
                    y--;
                    if (s[x] != t[y])
                    {
                        bt.Add(new Tuple(2, s[x] + "" + t[y]));
                    }
                    break;
            }
        }
        bt.Reverse();
        return new Tuple>>(d[s.Length, t.Length], bt);
    }
        public static int LevenshteinDistance(string s, string t)
        {
            int[,] d = new int[s.Length + 1, t.Length + 1];
            for (int i = 0; i < s.Length + 1; i++)
            {
                d[i, 0] = i;
            }
            for (int j = 1; j < t.Length + 1; j++)
            {
                d[0, j] = j;
            }
            for (int i = 1; i <= s.Length; i++)
            {
                for (int j = 1; j <= t.Length; j++)
                {
                    d[i, j] = (new int[]
                    {
                        d[i - 1, j] + 1, // del
                        d[i, j - 1] + 1, // ins
                        d[i - 1, j - 1] + (s[i - 1] == t[j - 1] ? 0 : 1) // sub
                    }).Min();
                }
            }
            return d[s.Length, t.Length];
        }
    #endregion
}

The class is initialized with the following parameters:

string [] words - a list of language words;
long [] wordFrequency - word frequency;
IDictionary> mistakeFrequency
int errorProbSmoothing = 1 - error rate blur coefficient

Testing

For testing, we use the Peter Norvig database , which contains 333333 words with frequencies, as well as 7481 words with erroneous spellings. The following code is used to calculate the values necessary to initialize the NoisyChannel class:

reading base

string[] words = null;
long[] wordFrequency = null;
#region read priors
if (!File.Exists("../../../Data/words.bin") || !File.Exists("../../../Data/wordFrequency.bin"))
{
    IDictionary wf = new Dictionary();
    Console.Write("Reading data:");
    using (StreamReader sr = new StreamReader("../../../Data/count_1w.txt"))
    {
        string line = sr.ReadLine();
        while (line != null)
        {
            string[] parts = line.Split('\t');
            wf.Add(parts[0].Trim(), Convert.ToInt64(parts[1]));
            line = sr.ReadLine();
            Console.Write(".");
        }
        sr.Close();
    }
    Console.WriteLine("Done!");
    words = wf.Keys.ToArray();
    wordFrequency = wf.Values.ToArray();
    using (FileStream fs = File.Create("../../../Data/words.bin"))
    {
        BinaryFormatter bf = new BinaryFormatter();
        bf.Serialize(fs, words);
        fs.Flush();
        fs.Close();
    }
    using (FileStream fs = File.Create("../../../Data/wordFrequency.bin"))
    {
        BinaryFormatter bf = new BinaryFormatter();
        bf.Serialize(fs, wordFrequency);
        fs.Flush();
        fs.Close();
    }
}
else
{
    using (FileStream fs = File.OpenRead("../../../Data/words.bin"))
    {
        BinaryFormatter bf = new BinaryFormatter();
        words = bf.Deserialize(fs) as string[];
        fs.Close();
    }
    using (FileStream fs = File.OpenRead("../../../Data/wordFrequency.bin"))
    {
        BinaryFormatter bf = new BinaryFormatter();
        wordFrequency = bf.Deserialize(fs) as long[];
        fs.Close();
    }
}
#endregion
//del - 0, ins - 1, sub - 2 
IDictionary> mistakeFrequency = new Dictionary>();
#region read mistakes
IDictionary> misspelledWords = new SortedDictionary>();
using (StreamReader sr = new StreamReader("../../../Data/spell-errors.txt"))
{
    string line = sr.ReadLine();
    while (line != null)
    {
        string[] parts = line.Split(':');
        string wt = parts[0].Trim();
        misspelledWords.Add(wt, parts[1].Split(',').Select(w => w.Trim()).ToList());
        line = sr.ReadLine();
    }
    sr.Close();
}
mistakeFrequency.Add(0, new Dictionary());
mistakeFrequency.Add(1, new Dictionary());
mistakeFrequency.Add(2, new Dictionary());
foreach (string s in misspelledWords.Keys)
{
    foreach (string t in misspelledWords[s])
    {
        var d = NoisyChannel.LevenshteinDistanceWithBacktrace(s, t);
        foreach (Tuple ml in d.Item2)
        {
            if (!mistakeFrequency[ml.Item1].ContainsKey(ml.Item2))
            {
                mistakeFrequency[ml.Item1].Add(ml.Item2, 0);
            }
            mistakeFrequency[ml.Item1][ml.Item2]++;
        }
    }
}
#endregion

The following code initializes the model and searches for the correct words for the word " he ;; o " (of course, it means hello , the semicolon is to the right of l and it was easy to make a mistake when printing, the database does not contain the word helo in the spelling list for hello ; ; o ) at a distance of not more than 2, with the time being recorded:

NoisyChannel nc = new NoisyChannel(words, wordFrequency, mistakeFrequency, 1);
Stopwatch timer = new Stopwatch();
timer.Start();
IDictionary c = nc.GetCandidates("he;;o", 2);
timer.Stop();
TimeSpan ts = timer.Elapsed;
Console.WriteLine(ts.ToString());

For me, such a calculation takes on average a little less than 1 second, although of course the optimization process is not limited to the above methods, but rather, it only starts with them. Take a look at the replacement options and their likelihood:

References

The archive with the code can be merged from here .

Tags: