Pseudo-lemmatization, composites and other strange words

The content of the series of articles on morphology

• Morphology and computer linguistics for the smallest
• Role of morphology in computer linguistics
• Morphology. Tasks and approaches to their solution
• Pseudo-lemmatization, composites and other strange words

We did not manage to review all the tasks with you in a previous post, so we will continue in this.

It often happens that some neologism appears on the Internet. For example, troll. The word “troll” is in the dictionary, but there is no “troll” anymore, and, as we found out earlier, the prefix does not separate from the root during parsing, so we have no idea what kind of “troll” it is or how to change it. To analyze this word, we have to use pseudo-lemmatization. To do this, we again use the so-called reverse tree of endings (written from right to left).

We immediately find the empty ending. It can be assumed that “troll” is a noun that ends in an empty ending. Next we see a soft sign, and nothing just ends with a soft sign. But “-t” is a typical ending for verbs. Thus, we can assume that the word “troll” has the base “troll-”, the ending is “-t”. Now we can get other forms. If we discard the “-t” and substitute the inflection of the past tense of the masculine gender “l”, we get the word “troll”.

In addition, we understand that “troll” is an infinitive of the verb, which means that when we transform the sentence in which we meet the word “troll” into another language, we will understand that this word expressed some kind of action in the infinitive . Thus, we can translate it by transliteration, for example, “trolling” or “to troll”, and also somehow express it. This is precisely what the task of pseudo-lemmatization is: to parse unknown words, albeit not understanding semantics.

The tip of the iceberg

We examined the basic problems that morphology faces in computer linguistics. It is important to understand that this is a fraction of what she does. Here's a partial list of the issues we're working on.

Composites

Take the word "steam and steam locomotive". What is steam-, heat-, air-, building - separately it is clear. Problems begin when we begin to combine these roots, and you can combine them almost endlessly.

Composite rules are harmful and dangerous with composite explosions. When we analyze a word that needs to be disassembled according to a composite rule, a priori we must divide it after each letter, and only where the forms existing in the dictionaries are found, should we really assume separation. This is the first place where there could theoretically be an explosion, because in a language words from a single letter are often found: conjunctions, prepositions. Because of this, the number of dividing points can increase many times.

In addition, not all words can be glued together. For example, we consider the word heat-reconstruction; you can first restore the first connection locomotive-building (the meaning of the word "build diesel locomotives"). And the second one is possible: heat-reconstruction (meaning "to heat the reconstruction"). But the word reconstruction does not exist. At the same time, the word “locomotive” is quite a dictionary. It turns out that the order of bonding the pieces of the composite is important. The native speaker quickly understands how to correctly restore the meaning of the word. But the composites analysis algorithm is required to sort through the factorial of different sequence variants.

Collocations

Suppose we are trying to analyze a ticket that says: "Los Angeles-San Francisco . "
If you divide by spaces, we get the word "Angeles-San" and two separate words "Los" and "Francisco". What is “Angeles-San”? Respectful appeal to a Japanese named Angeles? Our system must understand that “Los Angeles” is one phrase, “San Francisco” is another, and that there are no such phrases as “Los Francisco” and “San Angeles”.

Input Error Correction Algorithms

Here we are faced with two tasks at once. Before undertaking anything, it is important to determine if the user made a mistake when entering a word, or was it intentional? Secondly, if he was still mistaken, it was necessary to understand what word the error was in, and what he really wanted to write.

Statistical processing

In Russian, as well as in any other, different words are used with different frequencies. In many tasks, knowing this frequency is invaluable. For example, in the same pseudo-lemmatization. The system finds two options, and it is necessary to decide which one to choose. If there is a context, then from it you can get some information that will help determine the correct option. If there is no context, it is necessary to display all the options found. And in this case it is better to give ranked options: those that are statistically more common, give out first, and those that are less common - last.

Discuss!

We examined in detail the main tasks that natural language processing takes on computer morphology. Of course, not all problems have been resolved, but we are working on it. If this topic interests you, you want to know something else or maybe share your ideas, I will be glad to chat with you in the comments

Tags: