Linguistic technologies ABBYY. From complex to perfect

Over the solution of problems associated with the automatic processing of a natural language and understanding of the meaning of the text by a machine, humanity has been struggling for more than a dozen years. Some success in this area was achieved by the Russian company ABBYY, which developed the universal linguistic platform Compreno to perform many applied tasks at a qualitatively different level.

The idea to deal with one of the key problems in the theory of artificial intelligence and solve the problem of understanding human speech by computer technology originated in the minds of ABBYY experts fifteen years ago. It was then that with the filing of the founder of the company, David Jan, research and development, and then experimental design and technological work began on creating a new generation machine translation system, which later developed into a separate Compreno project (formerly called Natural Language Compiler) to solve many tasks related to with natural language processing.

The seriousness of ABBYY’s intentions to make a revolution in the field of computer linguistics is evidenced not only by the many years of work of more than three hundred employees of the company, but also by the interest in the platform from the Development Fund of the Center for the Development and Commercialization of New Technologies (Skolkovo Foundation), which selects the most promising projects and implements them support. No less convincing is the financial side of the matter: the total investment of the Skolkovo Fund in Compreno is 475 million rubles, which is half the financing of the project. The second part (475 million rubles) is contributed by ABBYY itself. Impressive numbers highlighting the scope and scope of the project.
Technology Sum

To understand the nuances of the mechanisms underlying Compreno and the logic of their work, you need to understand the fundamental concept of the project, which is as follows. No matter what language civilized people speak, the concepts that they designate with words have much more in common than different. We all live in houses, use furniture, telephones, drive cars, go to work in offices, fly on airplanes, etc. These concepts are general and independent of the language in terms of how we imagine them. Having caught this connecting thread, ABBYY built a universal semantic hierarchy of concepts independent of a particular language.
The semantic hierarchy of concepts is a tree universal for all languages, the thick branches of which are more general concepts (for example, “movement”), and the thin ones are more specific semantic values, structured from general to particular (“crawl”, “fly”, “walk” on foot ”,“ run ”, etc.). If we are talking about the head of the organization, then the concept of “leader” appears at the head of this lexical class, and subclasses present more specific concepts, such as “boss”, “boss”, “leader”, “boss” and other words and phrases , which are a kind of leaflets on a tree of concepts.

image

Such a tree structure ensures the inheritance of properties from ancestors to descendants and avoids ambiguities in the process of translating sentences from one language to another. The developers give an explanation on the example of the meaning of the word “management”, in Russian corresponding to several concepts on different branches of the universal semantic tree: it is possible to interpret “management” as a department, or it is possible, for example, as an action. And due to the fact that the semantic class “management” in the sense of some organization is represented in one branch of the tree, and as actions in another, the system automatically selects the correct word when translating text into English, choosing department or management depending on the context of the phrase . Consequently,

The second major block of the Compreno platform is the syntax. It is important to understand that syntax describes how concepts are related to each other within one or more sentences. To encode these relationships in languages, the members of the sentence, coordination, word order, cases, various service words, conjunctions, prepositions and much more are used. Syntax is, figuratively speaking, a great constructor of these elements.

Different languages ​​may use different constructor elements. For example, in English, word order is an important part of the syntax. Interrogative sentences are formed in one way, narrative - in another, and nothing else. There are some optional circumstances of time and place that are put at the beginning of the sentence, but usually the subject is in the first place, the predicate is in the second and the rest of the speech is located further. In Russian, the situation is different. We are not tied to the word order, but on the other hand, coordination is important for us, which, in fact, is perhaps the biggest stumbling block for people studying Russian.

Another important thing to consider when parsing text is the substitutions and connections between words that occur when we miss a word, but understand that it is there anyway. A vivid example is the phrase "The boy loves red apples, and the girl is green." It’s clear that with regard to the girl we are talking about apples (as well as about the fact that she loves them), and we perfectly understood this, although a few words are missing in the text. There are other, more complex syntax links that are successfully parsed by Compreno. For example: "Although the boy wanted to play, he understood that he did not have much time." In this case, we twice replaced the word “boy” with the pronouns “he” and “him,” and it is important for the machine to understand that this is the same object and to restore missing nodes.

image

The Compreno syntax block parses the roles of various concepts in a sentence and links them together. The system analyzes the text and builds a tree of links, in which usually the main thing is some kind of action. Further from it come the object, subject and other attributes, attached either to the object or to the subject and conveying the meaning inherent in the concrete sentence. To make the parsing as accurate as possible, Compreno uses semantic analysis based on the above-described universal hierarchy of concepts. All this in total provides a new level of freedom in the processing of texts by the machine, allows it to "understand" the meaning of the original sentence and then synthesize this meaning in another language.

Finally, the third important component of the ABBYY linguistic platform is statistics, which allows the system to correctly combine phrases and more fully understand homonymy, when the same word can mean different things (a typical example: “castle” and “castle”). No less important is the statistical information for the correct analysis of sentences with an ambiguous interpretation. For example, a competent analysis of the phrase “These types of steel are in our workshop” can be done only by resorting to data on the frequency of relationships between concepts, thereby understanding the context of speech or, in other words, the subject of discussion. If it is about metallurgy, then the story is about steel, if about the behavior of people, then it will be logical to choose in favor of some not very good types.
The Compreno statistical model is based on an impressive set of texts of various subjects and genres, which are processed almost daily by the system. Moreover, the text data is not anyhow, but created or translated from one language into another by a person. Such an approach reduces the likelihood of errors in the process of decision making and distortions in the synthesis of semantic constructions.

What did you end up with? As a result, ABBYY specialists succeeded in combining knowledge, imagination, ideas and experience and built on the “three pillars” - the semantic hierarchy of concepts, syntax and statistics - a model of language-independent data about the world’s structure and a model for accessing this data. As a result, we managed to get as close as possible to understanding the meaning of the text by the computer and make it possible to solve a wide layer of linguistic problems. Which ones?

Mind Games

Speaking about the practical significance of the ABBYY Compreno platform, developers primarily focus on solving two key tasks - automatic translation of texts for many language pairs and intelligent information retrieval.
The first task related to the translation of text data is extremely important in the age of digital technology, erasing formal borders and barriers between countries. With constantly increasing volumes of multilingual information, the need to involve an increasing number of participants from different parts of the world in the implementation of modern projects, not only the speed of receiving the translation, but also the quality of the texts received at the output become critical. With the provision of the latter, the existing machine translation systems are not at all as smooth as they might seem at first glance. The reason for this is the numerous fundamental limitations in scientific approaches, which are the basis of many existing machine translators. These limitations are associated with the inability to correctly handle exceptions, the objective complexity of language structures, ignoring semantics, inability to fix real connections in a sentence and other problems. Compreno technology is an engineering embodiment of the fundamental linguistic research of many scientists of the world, accumulating approximately 50 years of experience. And thanks to this, Compreno can overcome these difficulties and allows you to synthesize the text with the same meaning as it was in the original language, or as similar as possible. To assess the capabilities of the system, an example of translating a piece of Google’s article “Babel fish” heralds future of translation using the statistical translator and the ABBYY platform is presented below. Comments, as they say, are unnecessary. Compreno technology is an engineering embodiment of the fundamental linguistic research of many scientists of the world, accumulating approximately 50 years of experience. And thanks to this, Compreno can overcome these difficulties and allows you to synthesize the text with the same meaning as it was in the original language, or as similar as possible. To assess the capabilities of the system, an example of translating a piece of Google’s article “Babel fish” heralds future of translation using the statistical translator and the ABBYY platform is presented below. Comments, as they say, are unnecessary. Compreno technology is an engineering embodiment of the fundamental linguistic research of many scientists of the world, accumulating approximately 50 years of experience. And thanks to this, Compreno can overcome these difficulties and allows you to synthesize the text with the same meaning as it was in the original language, or as similar as possible. To assess the capabilities of the system, an example of translating a piece of Google’s article “Babel fish” heralds future of translation using the statistical translator and the ABBYY platform is presented below. Comments, as they say, are unnecessary. To assess the capabilities of the system, an example of translating a piece of Google’s article “Babel fish” heralds future of translation using the statistical translator and the ABBYY platform is presented below. Comments, as they say, are unnecessary. To assess the capabilities of the system, an example of translating a piece of Google’s article “Babel fish” heralds future of translation using the statistical translator and the ABBYY platform is presented below. Comments, as they say, are unnecessary.

Source:
If we tried manually to give the system those languages, it would be a hopeless task. The only possible way we could do this is to harness the power of machine computation. We build statistical models that are automatically training themselves and learning all the time.

ABBYY Compreno:
If we tried to manually give the system those languages, this would be a hopeless task. The only possible way we could do this is to use the power of machine calculation. We create statistical models that automatically learn and learn all the time.

Statistical Translator:
If we tried manually to give the system these languages, it would be a hopeless task. The only possible way we could do this is to use the capabilities of the computing machine. We build statistical models that automatically educate ourselves and learn all the time.

The importance of the second task - intellectual search - is a consequence of the enormous amount of information generated by mankind, growing exponentially and requiring other approaches to the analysis and search for the necessary data. Now the search works mainly using verbal information: when searching for a document, we first think up the words that should be contained in it, then enter key phrases, get data that meets the search criteria and then manually select the information we are interested in. Such a familiar search has a number of major drawbacks. Firstly, it is far from always possible to formulate a query that accurately describes the information that needs to be found. Secondly, coming up with qualifying words, we narrow the selection and limit the search. Finally, iterating through all the key combinations is sometimes extremely tiring, if not impossible. ABBYY Compreno technologies successfully cope with all these shortcomings, which allow for a meaningful search using those concepts and relationships that were extracted by the machine from a search query formulated in a common language.
The “erudition” of the platform and the enormous amount of knowledge concentrated in it allow you to use Compreno to perform many other application tasks. On its basis, companies can create qualitatively new solutions for systems of multilingual search and classification of data, extracting facts and establishing links between objects, monitoring, protection systems against unauthorized use of information, automatic abstracting and annotation of documents, speech recognition and many other tasks.
An equally promising and interesting area of ​​application of Compreno is the solution of problems associated with the visualization of text. A striking example is the creation of animated videos and films based on text scripts. It is in this direction that the Bazelevs Innovations company works, also taking an active part in the Skolkovo project and has already achieved certain results in creating a software package for interactive three-dimensional visualization of texts. ABBYY proudly declares that in the world now there is no such universal platform that can solve so many applied problems that require high-quality linguistic analysis of texts.

Huge plans

Today, as mentioned above, more than 300 specialists participate in the project, young staff, students of the ABBYY department at MIPT and graduates of the country's leading universities - Moscow State University, Russian State Humanitarian University, Moscow State Linguistic University, St. Petersburg State University and many others are actively involved. If you look at the roots of the work, they lie in serious studies of Russian and world linguistics. This scientific baggage is used by ABBYY specialists. The plans of the company include attracting leading world experts in the field of linguistics and linguistics to participate in the project and giving the project international status.

ABBYY is currently implementing pilot projects to deploy software solutions based on Compreno. So far, the project initiators do not disclose details about the products being developed, but assure that everyone will ultimately benefit from their implementation and widespread implementation - both software manufacturers and consumers, that is, you and I.

It is too early to talk about how much the life of humankind will change the ambitious project ABBYY Compreno in the future. However, it is safe to say that in the near future, computer linguistics will make significant progress in the field of language modeling and will switch to a completely new technological base, the foundation of which is being laid now.

Also popular now: