What should we build a smart product for?

Original author: Aria Haghighi
  • Transfer
Recently, the phrase "Machine Learning" (Machine Learning, ML) has become incredibly fashionable. Like any common technology, enthusiasm here exceeds the level of implementation of specific products. One can argue, but few algorithmic technologies since the time of amazing innovations from Google 10-15 years ago led to the appearance of products that are widely spread in popular culture. It’s not that since then there have been no breakthroughs in machine learning, there haven’t been so shocking and based on computational algorithms. Netflix can use smart recommendations, but it is without Netflix. But if Brin and Page did not analyze the graph structure of the web and hyperlinks for their own selfish purposes, we would not have Google.

Why is that? After all, they tried the same. Many start-ups wanted to bring the technology of machine processing of natural language to the masses, but all in turn sunk into oblivion, after people, in fact, tried to use them. The difficulty of obtaining a good product using machine learning is not in understanding the basic theory, but in understanding the scope of the activity and the task. Understanding so deep that on an intuitive level to see what will work and what will not. Interesting tasks do not have ready-made solutions. Our current level in any applied areas, for example, the same natural language processing, is more strongly driven by revelations related to this area than new techniques for solving common problems of machine learning. Often the difference is the program used every day,

I'm not trying to convince you not to make cool machine learning products. I'm just trying to clarify why this is so difficult.

Progress in machine learning.

Machine learning has come a long way in the last ten years. When I entered graduate school, the training of linear classifiers for classes of objects with large indentation (i.e. SVM , the support vector method ) was carried out by the SMO algorithm . The algorithm required access to all data at once. Training time increased to indecent with the growth of the training sample. To simply implement this algorithm on a computer, you need to understand non-linear programming, and to highlight significant limitations and fine-tune parameters - black magic. Now we know how to train classifiers with almost the same efficiency in linear time in online mode using a relatively simple algorithm. Similar results appeared in the theory of (probabilistic) graphical models: Markov-chain Monte-Carlo and variational methods simplify the derivation of complex graphical models (although MCMC has been used for a long time by statistics, they have recently started using this technique in large-scale machine learning). It comes to the ridiculous - compare the top articles in the proceedings of the Association for Computational Linguistics (ACL) and see that the machine learning techniques used in recent times (2011) have become much more sophisticated than those used in 2003.

In the field of education, progress is also tremendous. Studying at Stanford in the first half of the 2000s, I taught Andrew Ng machine learning and Daphne Koller courses on probabilistic graphical models. Both of these courses are among the best that I took at Stanford. Then they were available to about a hundred people a year. The Koller course is perhaps not just the best of Stanford. He taught me a lot in teaching. These courses are now available to everyone on the Web .

As a person engaged in machine learning in practice (in particular, processing natural languages), I can say that all these achievements have facilitated the conduct of many studies. However, the key decisions that I make do not at all relate to the abstract algorithm, the form of the objective function or the loss function, but to the set of features specific to a particular task. And this skill comes only with experience. As a result, although it is great that a wider audience gets an idea of ​​what machine learning is, it is still not the most difficult part in creating smart systems.

Ready-made solutions are not suitable for interesting tasks

The real tasks that you want to solve are much more unpleasant. than the abstractions that machine learning theory offers you. Take machine translation , for example . At first glance, it looks like a statistical classification task: you take a sentence in some language and want to predict which sentence in your language will correspond to it. Unfortunately, the number of sentences in any common language is combinatorially huge. Therefore, the problem solver cannot be a black box. Any good solution method is based on the decomposition of the problem into smaller ones. And the program then learns how to solve these smaller tasks. I argue that progress in complex tasks, such as machine translation, is achieved by better partitioning and structuring the search space, and not by the cunning translation algorithms that we train in this space.

The quality of machine translation over the past ten years has grown by leaps and bounds. I think this happened mainly due to key insights in the field of translation, although general improvements to the algorithms also played a role. Statistical machine translation in its current form goes back to the remarkable work " The mathematics of statistical machine translation ", which introduced the architecture of a noisy channelon which translators will subsequently be based. If you explain on the fingers, it works like this: for every word there are possible translations into our language (including an empty word, in case our language does not have an equivalent to the translated). Imagine this as a probabilistic dictionary. The received words are rearranged to receive a sentence that is already harmonious in our language. In this explanation, many details were missed - how to work with candidate sentences, permutations, how to train models of standard permutations from some language to the target, how to evaluate the harmony of the result, in the end.

A key breakthrough in machine translation occurred just with the change in this model. Instead of translating individual words, new models began to consider translations of whole phrases. For example, the Russian “evening” roughly corresponds to the English “in the evening”. Before the translation of phrases, a model based on word-by-word translation could receive only a single word (IBM model 3 allows you to get several in each target language, but the likelihood of seeing a good translation is still small). It is hardly possible to get a good offer in English this way. Phrasal translation leads to a smoother, more lively text that resembles the speech of a medium. Of course, adding pieces of phrases leads to complications. It is unclear how to evaluate part of the phrase, despite the fact that we have never seen such a partition. No one will tell us that “in the evening” is a phrase that must match a phrase in another language. What is surprising here is exactly that the difference in the quality of translation is created not by a cool machine learning technique, but by a model sharpened for a specific task. Many, of course, tried to use more cunning learning algorithms, but the improvement from this was, as a rule, not so great.

Franz Och, one of the authors of the phrase translation approach, came to Google and became a key figure in the Translate group. Although the foundation of this service was laid back in the time of Franz’s work as a researcher at the Information Sciences Institute (and, before that, in graduate school), many ideas that allowed to step further than the translation by phrases came from engineering work related to scaling these ideas on the web. This work has produced stunning results in large-scale language models and other areas of NLP. It is important to note that Oh is not only a high-class researcher, but also, according to general reviews, an outstanding hacker (in the good sense of the word). This rare combination of skills allowed us to go all the way from a research project to what Google Translate is now.

Definition of the task.

But it seems to me that creating a good model is not the whole problem. In the case of machine translation or speech recognition, the task is clearly stated, and the quality criteria are simple to understand. Many of the NLP technologies that will hit applications in the coming decades are much more blurred. What exactly should an ideal study contain in the field of modeling thematic articles, conversations , characterization of reviews (third lab at nlp-class.org )? How to make a mass product based on this?

Consider the task of automatic summarization. We would like a product that abstracts and structures content. However, for a number of reasons, it is necessary to limit this formulation to something for which a model can be built, structured and, ultimately, evaluated. For example, in the literature on recapitulation, a task is usually formulated as selecting a subset of sentences from a collection of documents and organizing them. Is this the problem that needs to be addressed? Is this a good way to annotate a piece of text written in long complex sentences? And even if the text is well annotated, will these Frankenstein sentences look natural to the reader?

Or, for example, the analysis of reviews. Do people need a black and white “good / bad” grade? Or do we need a more complete picture (ie “cool food, the situation sucks”)? Are customers interested in the attitude of each particular visitor or an accurate analysis of the totality of reviews?

Usually, the bosses answer such questions and let engineers and researchers to implement it. The problem is that machine learning quite strictly limits the classes of problems that are solvable from an algorithmic or technical point of view. Based on my experience, people who understand approaches to solving similar problems, have a deep understanding of the problem area, can offer ideas that simply would not arise among specialists without such an understanding. I will draw a crude analogy with architecture. You can’t build a bridge just like that. Physics and sopromat impose severe restrictions on the design, and therefore it makes no sense to let people without knowledge in these areas to design bridges.

To summarize, if you want to make a really cool product with machine learning, you need a team of cool engineers / designers / researchers. In all areas, from basic machine learning theorems to building systems, domain knowledge, user interaction techniques and graphic design. Preferably, world-class specialists in one of these areas and well versed in others. Small talented teams with a full set of the above will be well-versed in the uncertain world of product design and promotion. Large companies, where R&D and marketing are located in different buildings, will not succeed. Cool products will be created by teams in which everyone has eyes that see the full context,

Also popular now: