How is Alice arranged? Yandex lecture

    This lecture for the first time discusses technological solutions on the basis of which Alice, the voice assistant of Yandex, works. The head of the interactive systems development group Boris Yangel hr0nix tells how his team teaches Alice to understand the user's wishes, find answers to the most unexpected questions and behave decently.


    “I'll tell you what's inside Alice.” Alice is big, she has a lot of components, so I’ll run a little superficially.

    Alice is a voice assistant launched by Yandex on October 10, 2017. It is in the Yandex application on iOS and Android, as well as in the mobile browser and as a separate application for Windows. There you can solve your problems, find information in a dialogue format, communicating with it in text or voice. And there is a killer feature that made Alice quite famous in RuNet. We use not only well-known scenarios. Sometimes when we don’t know what to do, we use all the power of deep learning to generate an answer on behalf of Alice. It turns out pretty funny and allowed us to ride a hype train.

    What does Alice look like high level?

    The user says: “Alice, what will the weather be like tomorrow?”

    First of all, we stream his speech into a recognition server, he turns it into text, and then this text gets into the service my team is developing, in such an entity as an intent classifier. This is a machine-trained thing whose task is to determine what the user wanted to say with his phrase. In this example, the classifier of intentions could say: okay, probably the user needs the weather.

    Then for each intent there is a special model called the semantic tagger. The task of the model is to highlight useful bits of information in what the user said. A weather tagger could say that tomorrow is the date on which the user needs weather. And we turn all these results of the analysis into some structured representation, which is called a frame. It will be written in it that it is intent weather, that weather is needed for +1 day from the current day, and where is unknown. All this information falls into the dialog manager module, which, in addition to this, knows the current context of the dialogue, knows what happened before that moment. He receives the results of parsing the replica, and he must decide what to do with them. For example, he can go to the API, find out the weather for tomorrow in Moscow, because the user's geolocation is Moscow, even though he didn’t indicate it. And to say - generate a text that describes the weather, then send it to the speech synthesis module, which will speak with the user in Alice's beautiful voice.

    Dialog Manager. There is no machine learning, no reinforcement learning, there are only configs, scripts and rules. It works predictably, and it’s clear how to change it if necessary. If the manager comes and says, change, then we can do it in a short time.

    The Dialog Manager concept is based on a concept known to those involved in interactive systems as form-filling. The idea is that the user with his remarks fills out some kind of virtual form, and when he fills in all the required fields in it, his need can be met. Event-driven engine: every time a user does something, some events happen that you can subscribe to, write their handlers in Python, and thus construct the dialogue logic.

    When you need to generate a phrase in scripts - for example, we know that the user is talking about the weather and we need to answer about the weather - we have a powerful template language that allows us to write these phrases. This is how it looks.

    This is an add-on over the Jinja2 pet template engine, to which all sorts of linguistic tools have been added, for example, the ability to inflect words or match numerals and nouns so that you can easily write coherent text, randomize pieces of text to increase Alice's speech variability.

    In the classifier of intents, we managed to try many different models, ranging from logistic regression to gradient boosting, recurrent networks. As a result, we settled on a classifier that is based on the nearest neighbors, because it has a bunch of good properties that other models do not have.

    For example, you often need to deal with intents, for which you have just a few examples. It’s impossible to learn the usual multiclass classifiers in this mode. For example, it turns out that in all the examples, of which there are only five, there was a particle “a” or “how”, which was not in the other examples, and the classifier finds the simplest solution. He decides that if the word "how" is found, then this is precisely this intent. But that is not what you want. You want the semantic closeness of what the user said to the phrases that lie in the train for this intent.

    As a result, we pre-train the metric on a large dataset, which indicates how semantically close the two phrases are, and then we use this metric, looking for the closest neighbors in our trainet.

    Another good quality of this model is that it can be updated quickly. You have new phrases, you want to see how Alice’s behavior will change. All that is needed is to add their many potential examples for the classifier of the nearest neighbors, you do not need to re-select the entire model. Suppose for our recursive model this took several hours. It is not very convenient to wait a few hours when you change something to see the result.

    Semantic tagger. We tried conditional random fields and recurrent networks. Networks, of course, work much better; it's no secret. We do not have unique architectures there, the usual bidirectional LSTMs with attention, plus or minus state-of-the-art for the tagging task. Everyone does it and we do it.

    The only thing, we actively use the N-best hypotheses, we do not generate only the most probable hypothesis, because sometimes we need not the most probable one. For example, we often re-weight hypotheses depending on the current state of the dialog in the dialog manager.

    If we know that in the previous step we asked a question about something, and there is a hypothesis where the tagger found something and a hypothesis where he did not, then, probably, all other things being equal, the first is more likely. Such tricks allow us to slightly improve the quality.

    And the machine-trained tagger is sometimes mistaken, and the meaning of slots is not exactly found in the most plausible hypothesis. In this case, we are looking in the N-best hypothesis, which is better consistent with what we know about the types of slots, this also allows us to earn some more quality.

    Even in the dialogs there is such an Anaphora phenomenon. This is when you use a pronoun to refer to some object that was previously in the dialogue. Say, say "the height of Everest," and then "in what country is it located." We are able to resolve anaphora. For this we have two systems.

    One general-purpose system that can run on any replicas. It works on top of parsing all user replicas. If we see the pronoun in his current cue, we look for known phrases in what he said earlier, count the speed for each of them, see if it can be substituted for this pronoun, and choose the best one if we can.

    And we also have an anaphore resolution system based on form filling, it works something like this: if in the previous intent there was a geo object in the form, and in the current there is a slot for the geo object, and it is not filled, and we got into the current intent by the phrase with the pronoun “there,” you can probably import the previous geo object from the form and substitute it here. This is a simple heuristic, but makes a good impression and works cool. In the part of the intents, one system works, and in the part of both. We look at where it works, where it does not work, we customize it flexibly.

    There is an ellipse. This is when you omit some words in the dialogue, because they are implied from the context. For example, you can say “tell the weather,” and then “and on the weekend?”, Meaning “tell the weather on the weekend,” but you want to repeat these words because it’s useless.

    With ellipses, we also know how to work in approximately the following way. Elliptical or refinement phrases are separate intentions.

    If there is an get_weather intent, for which phrases like “tell the weather”, “what the weather is like today”, then he will have a get_weather_ellipsis pair of intent, in which all kinds of weather updates: “and for tomorrow”, “and for the weekend”, “and what's in Sochi ”and so on. And these elliptical intents in the classifier of intents compete on equal terms with their parents. If you say “in Moscow?”, The classifier of intentions, for example, will say that with a probability of 0.5 this is a refinement in the intent of the weather, and with a probability of 0.5 a refinement in the intent of searching for organizations, for example. And then the dialogue engine is reweighed by scores, which were assigned by the classifier of intentions, which assigned them taking into account the current dialogue, because for example, he knows that there was a conversation about the weather before, and it was hardly a clarification about the search for organizations, rather it is about the weather .

    This approach allows learning and defining ellipses without context. You can just type examples of elliptical phrases from somewhere without what happened before. This is quite convenient when you make new intents that are not in the logs of your service. You can either fantasize, or invent something, or try to gather long dialogs on the crowdsourcing platform. And you can easily synthesize for the first iteration of such elliptical phrases, they will somehow work, and then collect the logs.

    Here is the pearl of our collection, we call it the talker. This is the same neural network that, in any incomprehensible situation, responds to something on behalf of Alice and allows you to conduct often strange and often funny dialogs with her.

    The talker is actually a fallback. In Alice, it works so that if the classifier of intentions cannot confidently determine what the user wants, then another binary classifier first tries to solve - maybe this is a search query and we will find something useful in the search and send it there? If the classifier says no, this is not a search query, but just chatter, then a fallback to the talker is triggered. The talker is a system that receives the current context of the dialogue, and its task is to generate the most appropriate response. Moreover, scenario dialogs can also be part of the context: if you talked about the weather, and then said something incomprehensible, the talker will work.

    This allows us to do such things. You asked about the weather, and then the talker somehow commented on it. When it works, it looks very cool.

    The talker is a DSSM-like neural network where there are two encoder towers. One encoder encodes the current dialog context, the other encodes a candidate response. You get two embedding vectors for the answer and context, and the network learns so that the cosine distance between them is the greater, the more appropriate the given answer in the context and the more inappropriate. In the literature, this idea has long been known.

    Why does everything seem to work well for us - it seems that it’s a little better than in the articles?

    There is no silver bullet. There is no technology that will suddenly make a cool talking neural network. We managed to achieve good quality, because we won a little bit everywhere. We've been picking up the architectures of these encoder towers for them to work best. It is very important to choose the right sampling scheme for negative examples in training. When you study in interactive cases, you have only positive examples that were once said by someone in such a context. But there are no negative ones - they need to be somehow generated from this building. There are many different techniques, and some work better than others.

    It is important how you choose the answer from the top candidates. You can choose the most likely answer offered by the model, but this is not always the best that can be done, because when training the model did not take into account all the characteristics of a good answer that exist from a product point of view.

    It is also very important what data sets you use, how to filter them.

    In order to collect bit by bit of all this quality, you must be able to measure everything that you do. And here our pride lies in the fact that we are able to measure all aspects of the quality of the system on our crowdsourcing platform by button. When we have a new algorithm for generating results, in a few clicks we can generate the response of the new model on a special test case. And - measure all aspects of the quality of the resulting model in Tolok. The main metric we use is the logical relevance of the answers in context. Do not talk nonsense that is in no way connected with this context.

    There are a number of additional metrics that we are trying to optimize. This is when Alice addresses the user to “you”, speaks about herself in a masculine manner and pronounces all sorts of insolence, filth and stupidity.

    High level, I told everything I wanted. Thanks.

    Also popular now: