The landscape of cloud machine translation services. Lecture in Yandex

Published on November 18, 2018

The landscape of cloud machine translation services. Lecture in Yandex

    This is the last report from the sixth Hyperbaton , which we will publish on Habré. Grigory Sapunov from Intento shared his approach to assessing the quality of cloud machine translation services, spoke about the results of the assessment and the main differences between the available services.


    - My name is Grigory Sapunov, I will tell you about the landscape of cloud machine translation services. We have been measuring this landscape for over a year now, it is very dynamic and interesting.



    I will tell you what it is, why it is useful to understand what is happening there, about the available solutions, which are quite a lot, about comparing stock models, pre-trained machine translation models, about customized models that began to appear actively in the last year, and I will give my recommendations on the choice of models.

    Machine translation has become a very useful tool that helps automate many different tasks. It replaces the person only in some topics, but at least can greatly reduce costs. If you need to translate a lot of product descriptions or reviews on a large web service, then the person here is simply not able to cope with the large flow, and machine translation is really good. And there are already many ready-made solutions on the market. These are some kind of pre-trained models, they are often called stock models, and models with domain adaptation, which has been strongly developed lately.

    At the same time, it is quite difficult and expensive to create your own machine translation solution. Modern technology of machine translation, neural network machine translation, requires a lot of things to take off inside. You need talents to do this, you need a lot of data to train it, and time to do it. In addition, neural network machine translation requires much more machine resources than previous versions of machine translations such as SMT or rule-based.

    At the same time, machine translation, which is available in the cloud, is very different. And the right choice of machine translation allows you to greatly simplify life, save time, money and eventually solve your problem or not. The variation in quality, reference-based metrics, which we measure, can be four times.



    At the same time, prices may vary by a factor of 200. This is a completely abnormal situation. Services of more or less the same quality can differ by 200 times. This is an easy way for you to save or spend extra money.

    At the same time, services significantly differ in product characteristics. This may be format support, file support, the presence of a batch mode, or the lack of it, this is the maximum amount of text that a service can translate at one time, and much more. And all this needs to be understood when choosing a service. If you choose the wrong service, you will either have to redo it or you will not get the quality you would like to receive. In the end, it comes down to the fact that you quickly bring something to the market, save money, provide the best quality to your product. Or do not provide.



    Compare these services to understand exactly what suits you, long and expensive. If you do this yourself, you must integrate with all cloud machine translation services, write these integrations, enter into contracts, first arrange a separate billing, integrate with everyone. Next, drive through all these services some of your data, evaluate. It is prohibitively expensive. The budget of such a project may exceed the budget of the main project for which you are doing this.

    So this is an important topic, but it’s difficult to study independently, and in this place we are good at helping to understand what’s what.



    There is a range of technologies on the market. Almost all services have moved to a neural network machine translation or some kind of hybrid. There are still a number of statistical machine translators on the market.



    Each has its own characteristics. NMT seems to be a more modern good technology, but there are also some subtleties.

    In general, neural network machine translation works better than previous models, but you also need to follow it, there are completely unexpected results. As a true Yoda, he can be silent, give an empty answer to some string, and you need to be able to catch it and understand that it behaves like this on your data. Or a great example from e-commerce, when a large description of the goods was sent to machine translation, and he just said that this was a backpack and that's it. And it was the stable behavior of this machine service, which is good and works fine on general data, news data. But in this particular area, e-commerce works poorly. And you need to understand this, you need to drive off all these services on your data in order to choose the one that fits your data best. This is not a service that will work better on news or something else. This is the one which should work better on your particular case. This must be understood in each case.



    There are many levels of customization. Zero level - its absence. There are stock models pre-trained, these are all those that are deployed in the cloud now by different providers. There is an option with fully customized models on their cases, when you, conditionally, place an order in some company that deals with machine translation, it is a model for you from your data from scratch. But it is long, expensive, requires large shells. There is a large provider who will charge you $ 5,000 for such an experiment, the numbers of this order. Things that are expensive to try. And this does not guarantee you anything. You can train a model, but it will be worse than what is available on the market, and money is thrown to the wind. These are the two extreme options. Either stock model, or customized on your body.

    There are intermediate cases. There are glossaries, a very good thing that helps to improve current machine translation models. And there is a domain adaptation, which is now actively developing, some transfer learning, whatever hides behind these words, which allows you to train a certain general model or even a special model to train on your data, and the quality of such a model will be better than just a general model. This is a good technology, it works, now in the stage of active development. Watch her, I will tell you about her later.



    There is another important dimension to raise or use the cloud. There is a popular delusion in this place, people still think that cloud machine translation services, if you use them, will take your data and train their models on them. This is not true for the last year or two. All major services have refused this, they have explicitly stated in terms of service that we do not use your data to train our models. It is important. This removes a bunch of barriers to the adaptation of cloud machine translation. Now you can safely use these services and be sure that the service will not use your data to train your models, and it will not become your competitor with time. It's safe.

    This is the first advantage of clouds compared to what it was two years ago.

    The second advantage, if you deploy a neural network transfer inside you, you need to raise a rather heavy infrastructure with graphic accelerators to train all these neural networks. And even after training for inference, you still need to use high-performance graphics cards to make it work. It turns out expensive. The cost of owning such a decision is really big. And a company that is not going to professionally provide API to the market does not need to do this, you need to take a ready-made cloud service and use it. In this place you have savings in money, in time, and there is a guarantee that your data will not be used for service needs.

    About the comparison.



    We have been dealing with this topic for a long time, we have been regularly measuring quality for a year and a half. We have chosen automatic reference metrics, they allow you to do this massively, and get some confidence intervals. We more or less know at what amount of data the quality metrics settle down, and we can make an adequate choice between different services. But we must remember that the metrics are automatic and human metrics complement each other. Automatic metrics are good for conducting a preliminary analysis, choosing places that people should especially pay attention to, and then linguists or domain experts look at these translation options and choose what suits you.



    I’ll tell you about the systems on the market, how we all analyzed them, how they compare with prices, and I’ll tell you about our analysis results, what is important here in quality, and what is beyond quality also important when choosing a service.



    First of all, there are already a large number of cloud services of machine translation, we considered only those in which there are pre-trained models that can be taken and started to use, and they have a public API.

    There are still some number of services that do not have a public API or they are deployed inside, we do not consider them in our study. But even among these services there are already a large number of them, we measure and evaluate 19 such services. Practice shows that the average person knows several market leaders, but does not know about the rest. And they are, and they are good places.



    We took the popularity of languages ​​on the web and divided them into four groups. The most popular, more than 2% of sites, less popular and even less. There are four groups of languages ​​by which we analyze further, and from all this we focus on the first group, the most popular languages, and a little bit on the second.



    Support within the first three groups is almost 100%. If you need a language that is not super exotic, then you will get it from the cloud. And if you need an exotic pair, it may turn out that one of the languages ​​is not supported by any machine cloud translation service. But even with all the restrictions, about half of all possible pairs are supported. That's not bad.



    From all this, we tested 48 pairs, made up such a matrix, selected primarily English and all the languages ​​of the first group, partially the languages ​​within the first group, and a little English and the languages ​​of the second group. This more or less covers typical usage scenarios, but many other interesting things remain outside. We estimated these pairs, measured them and tell you what is happening there. Full report is on the link, it is free, we update it regularly, I will agitate you to use it.



    No numbers and axes are visible on this graph, but this is about the support of different languages ​​by different machine translation systems. On the X axis, different machine translation systems, on the Y axis in a logarithmic scale, the number of supported pairs in general and unique. For this picture red is unique, blue is everything. It can be seen that if you have a very exotic combination of languages, it may turn out that you need to use seven different providers because of the uniqueness, because only one of them supports a very specific pair that you need.



    To assess the quality, we chose news corps, general domain corpus. This does not guarantee that the situation in your specific data from another area will be the same, most likely not the same, but this is a good demonstration of how to approach this research in general, how to choose the right service for you. I will show on the example of news areas. It is easily transferred to any other area of ​​yours.



    We chose the hLEPOR metric, which is about the same as BLEU, but according to our intuitive feeling, it gives a better impression of how the services relate to each other. For simplicity, consider that the metric from 0 to 1, 1 is full compliance with a certain reference translation, 0 is a complete discrepancy. hLEPOR better gives an intuitive feeling, which means a difference of 10 units compared to BLEU. You can read about the metric separately, everything is described in the research methodology. This is a normal metric, a proxy metric, not perfect, but it conveys the essence well.



    The difference in prices is enormous. We made a matrix, for what price you can get a translation of 1 million characters. You can download and see the difference is colossal, from $ 5 to $ 1000 per million characters. Choosing the wrong service simply raises your costs tremendously, or choosing the right one can help save a lot in this place. The market is opaque, you need to understand what is worth and where is what quality. Keep this matrix in your head. It is difficult to compare all services, at a price, prices are often not very transparent, the policy is not very clear, there are some grades. This is all difficult, this table helps to make a decision.



    We have brought the results of our analysis into such funny pictures. This picture shows what the maximum available quality is for those pairs that we measured, the greener - the higher quality is available, what is the competition in these pairs, is there anything to choose from, conditionally, about 8 providers provide this The maximum available quality, somewhere only 2, and there is another dollar icon, this is about the price for which you get the maximum quality. The spread is large, somewhere cheap you can get acceptable quality, somewhere it is not very acceptable and expensive, various combinations are possible. The landscape is complex, there is no one super player who is everywhere better in everything, cheap, good and so on. Everywhere there is a choice, and everywhere it needs to be done rationally.



    Here we have drawn the best systems for these language pairs. It can be seen that there is no one better system, different services are better on different pairs in this particular area - news, in other areas the situation will change. Somewhere Google is good, somewhere Deepl is good, this is a fresh European translator, about which very few people know, this is a small company that successfully fights against Google and defeats it, really good quality. On the Russian-English pair, Yandex is stably good. Amazon recently appeared, connected the Russian language and others, and it is also not bad. This is a fresh change. A year ago, much of this was not, there were fewer leaders. Now the situation is very dynamic.



    Knowing the best system is not always important. It is often important to know the optimal system. If you look at the top 5% of systems for this quality, then among this top 5% is the cheapest, giving it a good quality. In this place, the situation is significantly different. Google leaves this comparison, Microsoft rises very much, Yandex becomes bigger, Amazon crawls out even more, more exotic providers appear. The situation becomes different.



    If you look at all the providers of machine translation, horizontally - different providers, vertically - how often the provider is in one of these tops, then almost every one of them is in the top 5% sooner or later. The best of them for any specific pairs measured are 7 providers, the optimal ones are also 7. This means that if you have a set of languages ​​into which you need to translate and you want to provide maximum or optimum quality, you need one provider not enough, you need to connect the portfolio of these providers, and then you will have the maximum quality, maximum efficiency for the money and so on. No one player is the best. If you have complex tasks, you need a lot of different pairs, you have a direct way to using different providers, it is better than using someone.



    The market is very dynamic, the number of offers is growing rapidly. We began to measure in the beginning of the 17th year, a fresh benchmark was published in July. The number of available services is growing, some of them are still in the preview, they do not have a public pricing, they are in some kind of alpha or beta that you can use, but the conditions are not very clear.



    Quality grows slower, but also grows. The main interest occurs within specific language pairs.



    For example, the situation within the English-Russian language pair is very dynamic. Yandex over the past six months has greatly improved its quality. Amazon appeared, it is presented to the right with one dot, it also goes close behind Yandex. The GTCom provider, which almost no one knows, is a good pumper, it’s a Chinese provider, it translates well from Chinese into English and Russian, and English - Russian also handles well.

    A similar picture occurs more or less in all language pairs. Everywhere something is changing, new players are constantly appearing, their quality is changing, models are being retrained. You see, there are stable providers, the quality of which does not change. In this case, the stable ones are rather dead, because there are other unstable ones, the quality of which is more or less improving. This is a good story, they are constantly improving.



    If you count a more complex metric about price-quality, then there are stable improvements. This means that the cost of high-quality machine translation is constantly decreasing, with every month, every year, you get more and more high-quality machine translation for less money. It's good.


    Link from the slide

    In addition to prices and quality, there is a huge layer of issues that are also important when choosing a particular provider. These are all sorts of product features, html support, xml, support for tricky and not so formats, bulk mode, autodetection of the language - a popular topic, support for glossaries, customization, service reliability. And what we call developer happiness, then you can read what we mean by reference.



    This is to create machine disaster. By DX, we mean a huge number of all sorts of different aspects, including the availability of good documentation, clear codes and error messages, compliance with HTTP standards, the presence of a certain playground to dynamically play with the API, the availability of convenient billing and much more, which greatly affects the adoption deciding whether to use a particular service or not. If the developer is cursing, connecting a new API, then this is a bad signal. The developer can say that we do not need it, and in fact some APIs are simply difficult to use for specific tasks, because they don’t support something of what you need. This is an important aspect.

    This is an example of a chart for one of the real services, which is relatively good compared to others. For many other services, this diagram is collected closer to zero, often there is no normal documentation, no SDK, it is unclear how to work with billing, it is impossible to upload data on the use of the service and much more. Support is not normal. This is a complex topic.

    We recently came across a great service that seems to be public, and the API documentation is available after signing the NDA. There are many strange cases. In fact, it is a factor in decision making. Know about him, he may emerge at some point.

    It was part of the stock models that are on the market. I hope I conveyed the general feeling that the market is dynamic, there are a lot of players and there is not one super leader. Each is better in one thing, and you will most likely have to build a portfolio of providers if you want to translate into many different languages.

    The second interesting topic is customized models that began to appear relatively recently. We began to measure these customized models, we will soon release a report, and now I will tell the preliminary results of this measurement.



    Now many services support some kind of customization. It can be some glossaries, it can be additional training on your data, and there are many providers. First of all, some top-end ones, such as Google, Microsoft, IBM, some more exotic ones, and some that few people know about, but they also allow it to be done.



    How do we compare here? We have chosen one special area, biohoney, there are not very many stock models for it, an area with special terminology. We chose a pair of English - German, simply because it was easier for us to assemble under this pair. And they tried to train these models on different training samples from 10 thousand to 1 million sentences. We made a test dataset from 2 thousand sentences, according to our measurements for 2 thousand sentences the metric is settled down and it is possible to adequately compare different services. 50 sentences is not enough.

    We chose the hLEPOR metric, and then we train all these providers with our datasets, measure the quality on our test dataset, and at the same time measure the quality of the stock models on this dataset, in order to understand which baseline, which reference point is in this place. I will show how the quality changes and what happens during training. In this place, an important aspect is the cost of owning these models. We will tell you about this separately in the report when we put all this together. But here the situation becomes more complicated, you have the costs of training models, some time and money for training, not always transparent. There is a cost of supporting this model and the cost of using it, it varies from one service to another. Of these three components consists of the cost of ownership, this is an important aspect, it must be calculated before switching to the custom engine.



    Preliminary results show that it really works. Here is an example on Microsoft, on 3 versions of its API. The biomed model works pretty badly, but this is a normal story, you can't assume that Microsoft is the worst. It works well on generic domains. Under this domain, apparently, he was not trained. This is a normal story in time to understand that the stock model does not work for your domain, but only 10 thousand offers are enough and Microsoft starts to work well on your specific domain. And consistently increasing dataset, you still increase this quality. This is a good story, it quickly adapted, it can be used.



    IBM, the stock model works well out of the box, but you can also raise the quality with additional training. This is not bad, the quality is growing normally. An improvement of even 2% is a good improvement.



    Google AutoML recently launched also works quite well, the stock model itself of good quality was on this particular dataset, and training models for 10 or 100 thousand sentences improves the quality.



    If all this is drawn on one picture, there is Microsoft, there is Google, there are some number of stock models - Yandex, Deepl, Amazon, Google stock, Microsoft stock. And it is clear that in this particular case this is an interesting case. How to make decisions in a similar situation? You need to understand that in your data domain some kind of stock model is bad, but some kind of it may be good. Yandex, Google and Deepl, it turns out, out of the box on bio-honey, they work quite well, and even exceed the quality of some of the trained models. It is interesting. If you understand this at the very beginning of the study, then you can stop at this and use the stock model. It's great.

    On the other hand, it gives you a certain lower limit of quality, relative to which you can further evaluate the improvements and understand whether they are worth the money or not, which you are going to pay for them. And the consistent increase in the size of the training dataset pumps these models quite well. You can get higher quality, in general, there is a steady improvement in the quality of service depending on how much data you sent to it. And remember, this is only your data, the service will not use them to train your general models. This is a significant difference. It still has not settled in their heads, but it happened. You can safely send data, and the service will not compete with you in the future.

    How do you consciously approach the choice of cloud translation engine for your specific tasks?



    Prepare the test case. Without it is difficult to compare. It is possible to compare with linguists, but this is expensive work, it is difficult to reproduce.

    Once you have prepared a test case, compare stock models on the market. It may be that some of them already suits you. It happens. We found that specific services work well out of the box, for example, on legal documents or on some other. They can be used immediately and not to train special models, you just need to find your engine, which was trained on data similar to yours. Either they fit, or they will set the lower bar for quality, with respect to which you will continue to compare either customized solutions with which someone will come to you, or other cloud solutions that you will need to train. It's a good story to know your baseline.

    Prepare a glossary and some sort of training building, if you can. And if you can spend the effort to collect such data sets, it is wise to try adaptable models. They will either fit or set the bar for a contractor who will make a very custom decision for you. In any case, it is likely to increase your overall quality. And then the choice is yours. Further purely economy - it is worth the increase in the quality of money that you will pay for it, or not worth it.


    Links from the slide: first , second , third , fourth , fifth

    How can we help in this place? To many We have reports on the comparison of machine translation systems, the latest report is published on the link and there are all the previous ones. We try to do them about once a quarter, they are free, read them, there are a lot of details. It is much more detailed than what I said today.

    Soon we will release a report on customized models, where we will write more about how services compare in terms of quality after training and what it all costs. We have a single API for all machine translation services. One integration is enough to use all the best services available on the market. We have an SDK under NodeJS, under .NET, we have a CLI. And besides, there will soon be an API that can be used to assess the quality of models. You can upload your data and drive it through the selected providers. Count the metrics, send us the resulting data, we will choose the best model. This process is well automated, it will be much cheaper and easier to choose what fits your specific case and start using it - through us or on your own.

    Soon we will have web tools for translation. Not all people who use machine translation want to write integrations, even work with one API. This is a clear story. It will be possible to try different services through the browser, understand who is the best for your case and use it.

    The main conclusions are that there is no single leader. Do not wait for one super-rating, which will say that one provider is the best. This is not true. Often you need to collect a portfolio of providers to ensure maximum quality. The quality of all stock models is constantly improving. We must follow this to understand that the service appeared or better in quality than yours, or more optimal in terms of money. The machine translation market is becoming more fragmented, providers or models trained in special packages appear, more efficient than general providers. Remember the deepl? This is an interesting provider who has managed to learn from his unique data and beat Google in many language pairs.

    In addition, remember that now, having your own unique data, you can train your models in cloud services and use them. The quality is likely to be much better than the default models, and certainly better than the wrong model. Thank.