NLP: Spell Checking - An Inside Look (Part 2)

    ( Part 1 ) Today we will talk about the levels of understanding of texts by our system, about which spelling errors are easy to catch, which ones are not very simple, and which are extremely difficult.

    To begin with, the text can be viewed from two points of view: either as a simple sequence of words, spaces and punctuation marks, or as a network of concepts connected with each other by syntactic-semantic dependencies. Say, in the sentence "I love big dogs" you can arrange the words in any order, while the structure of the relations between the words will be the same:



    Errors can also occur at different levels. You can make a mistake in the linear structure - double-type the same word, forget the dot, parenthesis and the like. That is, an error occurs in the process of building words in a chain, and a person is unlikely to make it out of ignorance of any grammatical rule. This is probably a matter of simple carelessness. However, at the level of chains, you can identify some real grammatical errors. Say, the English text should not contain the combinations “was been” - it should be either “was” or “has been”.

    But still, real grammar begins at the level of analysis of a network of concepts. Say, in our example about dogs, the subject “I” should be combined in person and number with the verb “love” regardless of the relative position of these words in the sentence. Of course, this consideration does not eliminate the need to catch simpler errors detected at the linear structure level.

    First levels

    It is wise to start error checking with the simplest. If something can be caught at the level of the banal search for substrings, why connect heavy artillery? The simplest functionality is the AutoCorrect list, which exists in MS Word. We are talking about obvious typos, corrected by the system automatically, even without user confirmation: abbout -> about, amde -> made, compleatly -> completely. It may be that AutoCorrect seems such an obvious function that it seems to make no sense to talk about it, however, as an active user of MS Word, AutoCorrect often helps, and I will be glad to see AutoCorrect in other word processors as well. And for this we work, in the end.

    Simple rules based on regular expressions treating text as one big line is a little more complicated. Today, we catch regular expressions with simple expressions related to typographic roughness: space (s) between punctuation marks, extra spaces between words, non-standard combinations like “!?” etc. In principle, even at the level of a simple substring search, you can find a number of errors skipped by the spell checker.

    Offers and Tokens

    Now we are moving on to more interesting things. In order to analyze the text precisely as text, and not as a string of characters, it is necessary to learn how to highlight structural elements in it. We are still far from the structure of relations between words, so let's start with the simplest thing - recognizing the boundaries of sentences. Why do we need this? Well, firstly, there are errors that are characteristic for the beginning and end of the sentence: forgot to start with a capital letter, forgot to put a dot at the end (ellipsis, question / exclamation marks). There are less obvious cases - for example, it is stylistically unacceptable to start a sentence with a number written in numbers. Secondly, without the existing sentence structure, it is impossible to proceed to the next stage of the analysis - to identify the relationships between the words of the sentence.

    Like everything in our world, at a glance the task of breaking the text into sentences no longer seems so simple. The obvious algorithm is to find the punctuation mark ending the sentence, followed by a capital letter. At the same time, we will be mistaken when we meet the person’s name: “As M. Ivanov stated ...” In a similar way, abbreviations occur at the point where the period occurs. In principle, you can add a list of names and abbreviations to the analyzer, but this solution will still not be without flaws. For example, it will have to be completely rewritten for any new language, where there are rules for ending sentences. In addition, there is an obvious problem: we proceed from the fact that the input text contains errors (the essence of our module is in their correction); so how to break the text into sentences in conditions of incorrect entry? In this case, we automatically lose the opportunity to see an error at the junction of proposals. If we consider that the border lies between the point and the capital letter, then the error of the form “forgot to put the capital letter” will immediately be out of reach, because the system simply does not understand that the border of sentences is in this place.

    Now the most popular approaches to breaking text into sentences in real conditions are using classification algorithms based on machine learning.

    Sentence splitter

    In short, the classification task is to determine the correct class of the object C based on its attributes (A1, ..., An). The learning algorithm is given a large selection of known objects at the input, on the basis of which it forms its idea of ​​how attribute values ​​affect the object's belonging to a particular class. Then you can apply to the algorithm input a set of attributes of an unknown object and get its most probable class.

    A textbook example of a classification problem is determining the type of iris flower based on four parameters - the length and width of the petal and sepals. There are three types of iris: Iris setosa, Iris virginica and Iris versicolor. The Iris dataset lists 50 entries of the following form:



    The classification algorithm can study this data and build a model for the correspondence of iris attributes to one or another particular type. If an unknown iris grows in my garden, I can measure its parameters and ask the algorithm what kind of iris belongs to.

    The classifier can also be used for other purposes: making a decision on the basis of a table of standard decisions in the context of certain circumstances, as well as revealing hidden patterns - another textbook example involves the study of the “attributes” of passengers who survived the crash of the Titanic in order to understand what were the chances of different groups of people (child / adult, male / female, passenger / team member, ticket holder 1/2/3 class).

    There are many different classification algorithms: decision trees, nearest neighbor, bayesian network, maximum entropy model, linear regression ... Each has its own advantages and disadvantages. Some work better with numerical data, others are easier to program, others are faster, and fourth formulate the identified classification rules in a form convenient for analysis.

    With regard to the breakdown of text into sentences, the classifier works as follows. The input is text, broken into sentences manually. The system studies the “attributes” of each end of the sentence (in fact, looks at what is to the left and right of the border) and builds a classification model. Now the algorithm can be fed with any point in the text and ask if it is the boundary of the sentence.

    What subtleties are there? First, here we have a not quite standard formulation of the classification problem. At the input, we get only contexts of the ends of the sentence:

    (A1, ..., An) -> (end of the sentence)

    In the classical setting, the knowledge base should include all the options, i.e.:

    (A1, ..., An ) -> (end of sentence)
    (A1, ..., An) -> (not end of sentence)

    In our case, cramming into the table all examples of non-ends of the sentence is too ruinous - the base will grow incredibly. Apparently, for this reason, the most frequently cited author of the machine learning scheme (as applied to our problem) A. Ratnaparkhiused the principle of maximum entropy. This model allows you to simply ask the probability of the object belonging to this class without regard to other possible classes. In other words, we ask the model what is the probability of this context to be the end of the sentence. If the algorithm answers that the probability is higher than 1/2, you can mark the context as the boundary of the sentences.

    I think it makes sense to try other classification algorithms. As far as I know, this has not been done; if hands reach - I’ll do it. The experiments of Ratnaparkhi show the accuracy of his algorithm in the region of 98%, that is, from one hundred ends of the sentences he guesses 98 correctly.

    Unfortunately, in the spellchecker, we again find that the input text may contain errors. If you train the model on the correct texts broken down on sentences, the computer will rightly decide that the sentence always starts with a capital letter. If we discard the “title” of the considered attributes, the accuracy of the model will drop. You can manually make a few mistakes in the reference texts (in some places “forget” the point, in some places replace the capital letter with a lower case). Further, in the Ratnaparkhi system, we must first find a potential border of proposals, and then ask the system about its opinion on this place. He does it simply: look for a point, an exclamation point or a question mark - and ask what is there. With us, the user can forget about the point - and what to do?

    Today, in addition to punctuation marks, I also check linefeeds (from personal experience - if I forget to put a dot somewhere, this is at the end of the paragraph). You can try to study all the spaces between words, but I'm afraid that accuracy may fall. In general, there is something to think about :)

    Okay, that's enough for today, we will continue next time.

    Also popular now: