Yandex.Translation offline. How computers learned to translate well

    Today, the updated Yandex.Translation application for iOS has been released on the App Store . Now it has the ability to full-text translation offline. Machine translation has gone from mainframes occupying entire rooms and floors to mobile devices that fit in your pocket. Today, full-text statistical machine translation, which previously required enormous resources, has become available to any user of a mobile device - even without connecting to a network. People have long dreamed of a “Babylonian fish”- A universal compact translator that you can always take with you. And, it seems, this dream is gradually starting to come true. We decided, taking the opportunity, to prepare a short excursion into the history of machine translation and talk about how this interesting area developed at the junction of linguistics, mathematics and computer science. “The machine does it all,” “The Electronic Brain translates from Russian into English,” “Robot-Bilingua,” such newspaper headlines were seen by readers of the jubilant press on January 8, 1954. And the day before, on January 7, the IBM 701 scientific computer took part in the famous Georgetown experiment



    translating about sixty Russian phrases into English. Seven Hundred and First used a dictionary of 250 words and six syntax rules. And, of course, a very carefully selected set of proposals on which testing was conducted. It turned out so convincing that enthusiastic journalists, citing scientists, stated that after a few years, machine translation would almost completely replace the classic “manual” one.

    The Georgetown experiment was one of the first steps in the development of machine translation (and one of the first applications of computers for working with natural language). Then many of the problems that were to be faced in the future were not so obvious. However, the main problem, ironically, was that it was just the same from the very beginning - the most difficult task was for the computer to work with ambiguous words. On more or less natural sentences, the system almost completely ceased to cope with the task. The complex multicomponent structure of such systems also created problems: for example, parsing did not always work correctly, and the compound word guitar pick (pick) could be translated as “guitar pick”. Ambiguous words were also poorly translated, the meaning of which depended on the context. For instance, text “Little John was looking for his toy box. Finally he found it. The box was in the pen ”caused (and continues to cause) a lot of difficulties - both the phrase“ toy box ”, translated as“ toy box ”, and not“ toy box ”, and“ in the pen ”, which translated as "In the pen," not in the playpen. The difficulties were enormous, and as a result, over the past 12 years, almost no progress was possible. In 1966, a devastating reportThe ALPAC (Automatic Language Processing Advisory Committee) put an end to research in machine translation for the next ten years.

    In the meantime, the mood after the Georgetown experiment was still very encouraging and machine translation predicted a great future, the Americans began to seriously think about using the new technology for strategic purposes. That is fully understood in the USSR. At the beginning of 1955, the Academy of Sciences of the USSR created two research groups - at the V.A. . Panov). Both groups began with a detailed study of the Georgetown experiment, and in 1956 Panov published a brochure in which he described the results of the first machine translation experiments conducted on a BESM computer. In the same 1956 was followed by a publication on similar research at the Institute. Steklov was authored by Olga Kulagina and Igor Melchuk, which was published in the September issue of the journal “Questions of Linguistics”. This publication was accompanied by various introductory articles, and here something interesting was discovered: it turned out that in 1933 a certain Pyotr Petrovich Troyansky, an Esperantist and one of the TSB co-authors, applied to the Academy of Sciences of the USSR with a machine translation project and a request to discuss this issue with the linguists of the Academy. Scientists were skeptical of the idea: discussions around the project lasted eleven years, after which communication with Troyansky was suddenly lost, and he himself allegedly left Moscow. This publication was accompanied by various introductory articles, and here something interesting was discovered: it turned out that in 1933 a certain Pyotr Petrovich Troyansky, an Esperantist and one of the TSB co-authors, applied to the Academy of Sciences of the USSR with a machine translation project and a request to discuss this issue with the linguists of the Academy. Scientists were skeptical of the idea: discussions around the project lasted eleven years, after which communication with Troyansky was suddenly lost, and he himself allegedly left Moscow. This publication was accompanied by various introductory articles, and here something interesting was discovered: it turned out that in 1933 a certain Pyotr Petrovich Troyansky, an Esperantist and one of the TSB co-authors, applied to the Academy of Sciences of the USSR with a machine translation project and a request to discuss this issue with the linguists of the Academy. Scientists were skeptical of the idea: discussions around the project lasted eleven years, after which communication with Troyansky was suddenly lost, and he himself allegedly left Moscow.

    This historical find surprised researchers; surveys began. I managed to find Troyansky’s copyright certificate in a “mechanized dictionary” that allows you to quickly translate texts into several languages ​​simultaneously. After the next plenary meeting, at which Lyapunov read a report on this invention, the Academy of Sciences created a special committee to study the contribution of Troyansky. Several years passed and, finally, in 1959, the article “P. P. Troyansky's Translation Machine: A Collection of Materials on a Machine for Translation from One Language into Others, Suggested by P. P. Troyansky in 1933” was published by I. K Belsky and D. Yu. Panov. Soon, the copyright certificate was published, from which a very original technological solution of the device was visible.



    In the project, Troyansky’s machine was a table with an inclined surface, in front of which a camera was mounted, combined with a typewriter. The keyboard of a typewriter consisted of ordinary keys, which made it possible to encode morphological and grammatical information. The ribbon of the typewriter and the film of the camera had to be connected together and fed in synchronously. On the very surface of the table was supposed to be located the so-called “glossary field” - a freely moving plate with the words printed on it. Each of the words was accompanied by translations in three, four or more languages. All words had to be given in the initial form and arranged on the blackboard so that the most frequently used words were closer to the center - like letters on the keyboard. The machine operator was supposed to shift the glossary field and take a photograph of the word and its translations, while typing on a typewriter related to the word grammatical and morphological information. The result was two tapes: one with words in several languages ​​at once, and the second with grammatical explanations for them. When the entire source text was typed in this way, the material went to native speakers - auditors, who had to compare two tapes and compose texts in their languages. Further, the materials were to be handed over to editors who knew both languages. Their task was to bring the text to literary form. and the second with grammatical explanations for them. When the entire source text was typed in this way, the material went to native speakers - auditors, who had to compare two tapes and compose texts in their languages. Further, the materials were to be handed over to editors who knew both languages. Their task was to bring the text to literary form. and the second with grammatical explanations for them. When the entire source text was typed in this way, the material went to native speakers - auditors, who had to compare two tapes and compose texts in their languages. Further, the materials were to be handed over to editors who knew both languages. Their task was to bring the text to literary form.



    The main idea of ​​the invention is the division of the translation process into three main stages (by the way, the first and last in modern terminology would be called "pre-editing" and "post-editing"). Interestingly, the most time-consuming processes (coding of the source text and synthesis of texts in other languages ​​from this information) require only knowledge of the native language from the operators.

    Thus, the translation was carried out first between the natural language and its logical form, then between the logical forms of the two languages, and after that the text in the logical form of the target language was verified and reduced to the natural form. Troyansky, as a historian of science, undoubtedly knew about the theories of Leibniz and Descartes about the creation of a universal language and translation through interlingua. The technology proposed by him traces the influence of these theories. Moreover, Troyansky was an Esperantist, and built a system for coding grammatical information based on Esperanto grammar (which he was later forced to abandon for political reasons).

    What is especially interesting, already in the forties Troyansky considered the prospects of creating a “powerful translation device based on modern communication technologies”. However, during his lifetime, the ideas of the inventor were met with great skepticism by the academic community and were subsequently forgotten. The Trojan died in 1950, not having lived quite a bit before starting work on machine translation in the Soviet Union. The English machine translation researcher John Hutchins believes that if Troyansky’s contribution were not forgotten, the principles of his translation machine would form the basis of the first experiments at BESM, and this would put the inventor in the ranks of the “fathers” of machine translation along with Warren Weaver. But, unfortunately, history has no subjunctive mood.

    Fast forward forty years to the eighties. After ALPAC, no one but the most desperate enthusiasts had a serious desire to do machine translation. However, as often happens, business has become the engine of progress. In the late sixties, the course towards globalization of the world was already obvious. International companies faced an urgent need to maintain close trade contacts in several countries at the same time. In the 1980s, the business demand for technology for quick translation of documents and news increased: and then machine translation was “uncovered”. The European Economic Community, the future European Union, was not far behind - in 1976, SYSTRAN began to be actively used in this organization.- The first commercial machine translator in history. In the future, this system became an almost mandatory acquisition of any self-respecting international company: General Motors, Dornier and Aerospatiale. Japan did not stand aside either: the ever-increasing volumes of work with the West forced large Japanese corporations to conduct their development in this area. True, in most cases, they (like Sistran) were somehow variations of rule-based systems, with their well-known "generic" injuries - the inability to work correctly with ambiguous words, homonyms and idiomatic expressions. Such systems were also very expensive, since the creation of dictionaries required the work of a large staff of professional linguists, as well as inflexibility - adaptation to the required subject area was rather expensive not to mention the new language. Researchers still preferred to focus on systems that used rules, as well as semantic, syntactic, and morphological analysis.

    A truly new era of machine translation began in the 1990s. Researchers realized that natural language is very difficult to formally describe, and it is even more difficult to apply formal descriptions to living text. It was too difficult and resource-intensive task. It was necessary to look for other ways.

    As usual, when the problem seems almost insoluble, it is useful to change the perspective. IBM reappeared on the scene, one of whose research groups developed a statistical machine translation system called Candide. Specialists approached the problem of machine translation from the point of view of information theory. The key idea was the concept of the so-called channel with errors (noisy channel). The channel model with errors considers the text in language A as encrypted text in any other language B. And the task of the translator is to decrypt this text.

    Let’s resort to a funny illustration. Imagine an Englishman who is studying French and has come to France to practice it. The train arrived in Paris, and our hero needs to find a left-luggage office at the Gare du Nord station. After an unsuccessful search, he finally turns to a random passerby and, having thought over the phrase in English in advance, asks him in French whether he knows where to find the left-luggage office. The conceived English phrase, as it were, is “distorted” and turns into a phrase in French. Unfortunately, a passerby is an Englishman, and knows French quite poorly. He restores the meaning of the phrase, trying to restore with the help of his knowledge in French and an approximate idea of ​​what his interlocutor most likely meant - that is, to put it more simply, he tries to guess which English phrase he intended.

    IBM employees worked just with French and English: in the hands of the research team was a huge number of parallel documents from the turnover of the Canadian government. Researchers constructed their translation models as follows: collected probabilities for all combinations of words of a certain length in two languages ​​and probabilities for each of these combinations to correspond to a combination in another language.

    Further, the most probable translation of e , for example, into English, for, for example, the French phrase f, can be defined as follows:



    where E- these are all English phrases in the model. As an Englishman tried to guess the thoughts of his compatriot, the algorithm tries to find the most frequent phrase in English, which would have at least something to do with what could potentially be conceived when the French phrase was pronounced.

    This simple approach turned out to be the most effective. IBM did not apply any linguistic rules, and, in fact, almost no one knew the French language in the group. Despite this, Candide worked, and what's more, it worked pretty well! The results of the study and the overall success of the system were a real breakthrough in the field of machine translation. And most importantly, the Candide experience has proven that it is not necessary to have an expensive staff of first-class linguists to draw up translation rules. The development of the Internet has provided access to a huge amount of data needed to create large models of translation and language. Researchers have concentrated their efforts on developing translation algorithms, collecting cases of parallel texts, and aligning sentences and words in different languages.

    In the meantime, statistical machine translation was in the industrial development stage and slowly reached Internet users, rule-based systems dominated the online translation market. It should be noted that - rule-based translation appeared long before the Internet and began to advance to the masses with programs for desktop computers, and, a little later, portable (palm-size and handheld) devices. Versions for online users appeared only in the mid-90s and the most widely used was the familiar Sistran. In 1996, it became available to Internet users - the system allowed you to translate small texts online. Soon after this development, Sistran began to use the AltaVista search engine, launching the BabelFish service, which successfully survived as part of Yahoo until 2012.

    Pioneer of statistical online translation Google launched the first version of the Translate service only in 2007, but very quickly gained worldwide popularity. Now the service offers not only translation for more than 70 languages, but also many useful tools like error correction, scoring, etc. ... It is followed by a not so popular, but quite powerful and actively developing online translator from Microsoft, offering translation for more than 50 languages. In 2011, Yandex.Translation appeared, which now supports more than 40 languages ​​and offers a variety of tools to simplify typing and improve translation quality.

    The history of Yandex.Translation began in the summer of 2009, when Yandex began research in the field of statistical machine translation. It all started with experiments with open systems of statistical translation, with the development of technologies for searching parallel documents and the creation of testing systems and evaluating the quality of translation. In 2010, they began work on highly efficient translation algorithms and programs for constructing translation models. On March 16, 2011, a public beta version of Yandex.Translation service was launched with two language pairs: English-Russian and Ukrainian-Russian. In December 2012, a mobile application for the iPhone appeared, six months later a version for Android, and six months later a version for Windows Phone.

    Here we return to the starting point of the story - the advent of offline translation. Recall that statistical machine translation was originally developed to work on powerful server platforms with unlimited RAM resources. But not so long ago, a movement began in the opposite direction - the processing of powerful server applications into compact applications for smartphones. Two years ago, the Bing Translator application for Windows Phone learned to work without an Internet connection, and in 2013, Google launched a full-text offline translation on the Android platform. Yandex also worked in this direction, and now in the Yandex.Translation mobile app for iOS, it became possible to use offline dictionary first, and now full-text translation. What the mainframe system used to be for, and then a powerful server with dozens of gigabytes of RAM, today it fits in your pocket or purse and works autonomously - without accessing the remote server. Such a translator will work where there is no Internet yet - high above the clouds, in twenty thousand leagues under the sea, and even in space.

    Summing up, we can say that in the field of machine translation over the past decades, tremendous progress has been made. And although it is still very far from an instant and invisible translation from any language of the galaxy, the fact that a huge leap has been made in this area over the past few decades does not raise any doubts, I would like to hope that new generations of machine translation systems will steadily strive for it.

    Also popular now: