Neural conversational models: how to teach a neural network small talk. Lecture in Yandex

    A good virtual assistant should not only solve user tasks, but also reasonably answer the question “How are you?”. There are a lot of replicas without an explicit goal, and to prepare an answer to each is problematic. Neural Conversational Models is a relatively new way to create conversational systems for free communication. It is based on networks trained on large cases of dialogue from the Internet. Boris hr0nix Yangel tells how such models are good and how to build them.


    Under the cut - decoding and the main part of the slides.



    Thank you for coming. My name is Borya Yangel, I am engaged in Yandex application of deep learning to texts in natural language and dialogue systems. Today I want to tell you about the Neural Conversational Models. This is a relatively new area of ​​research in deep learning, the task of which is to learn how to develop neural networks that talk with the interlocutor on some common topics, that is, they conduct what can be conditionally called small talk. They say “Hello”, discuss how you are doing, the terrible weather or the movie that you recently watched. And today I want to tell you what has already been done in this area, what can be done in practice, using the results, and what problems remain that have yet to be solved.

    My report will be organized approximately as follows. First, we will talk a little bit about why it may be necessary to teach neural networks in small talk, what data and neural network architectures we will need for this, and how we will train to solve this problem. In the end, we’ll talk a bit about how to evaluate what we’ve got as a result, that is, metrics.

    Why teach networks to talk? Someone might think that we are teaching to make artificial intelligence that someday enslave someone.

    But we do not set such ambitious goals. Moreover, I strongly doubt that the methods that I will talk about today will help us get very close to creating real artificial intelligence.

    Instead, we aim to make more interesting voice and interactive products. There is a class of products called, for example, voice assistants. These are applications that, in a dialogue format, help you solve some urgent tasks. For example, find out what the weather is like now, or call a taxi or find out where the nearest pharmacy is located. You will learn about how such products are made in the second report.my colleague Zhenya Volkov, and now I'm interested in this moment. I would like that in these products, if the user does not need anything right now, he can chat with the system about something. And the product hypothesis is that if it is possible to chat with our system sometimes, and these dialogs are good, interesting, unique, not repeating, then the user will return to such a product more often. I want to make such products.

    How can they be made?

    There is a way that the creators of Siri, for example, have taken - you can take and prepare many replicas of answers, some replicas that the user often speaks. And when you say one of these remarks and get the answer created by the editors - everything is great, it looks great, users like it. The problem is that you should take a step away from this scenario, and you immediately see that Siri is nothing more than a stupid program that can use a phrase in one replica, and in the next replica say she does not know the meaning of this phrase - which, at least, is strange.

    Here is an example of a similar structured dialogue with the bot, which I made using the methods that I will tell you about today. Perhaps he never answers as interestingly and ornately as Siri, but at any moment he does not give the impression of a really stupid program. And it seems that it might be better in some products. And if this is combined with the approach that Siri uses and responds with editorial remarks, when you can otherwise do fallback on such a model, it seems that it will turn out even better. Our goal is to make such systems.

    What data do we need? Let me get ahead of myself a bit and first say what task setting we will work with, because it is important for discussing our issue. We want from the replicas in the dialogue up to the current moment, and also, possibly, some other contextual information about the dialog - for example, where and when this dialogue takes place - to predict what the next replica should be. That is, to predict the answer.

    To solve such a problem with the help of deep learning, it would be nice for us to have a corps with dialogs. This corpus would be better to be big, because deep learning with small text corps - you yourself probably know how it works. It would be nice if the dialogues were on the topics we need. That is, if we want to make a bot that will discuss your feelings with you or talk about the weather, then such dialogs should be in the dialog box. Therefore, the case of dialogues with the support service of the Internet provider is hardly suitable for us in solving the problem.

    It would be nice to know in the corpus the author of each replica at least at the level of a unique identifier. This will help us somehow simulate the fact that, for example, different speakers use different vocabulary or even have different properties: they are called differently, they live in different places and they answer different questions in different ways. Accordingly, if we have any metadata about speakers — gender, age, place of residence, and so on — this will help us to model their features even better.

    Finally, some metadata about dialogs - time or place, if these are dialogs in the real world - are also useful. The fact is that two people can conduct completely different dialogs depending on the spatio-temporal context.

    In literature, that is, in articles about Neural Conversational Models, two datasets are very fond of.

    The first one is Open Subtitles. These are just subtitles from a huge number of American films and TV shows. What are the advantages of this dataset? It has a lot of life dialogues, directly the ones that we need, because these are films, series, there people often say to each other: “Hello! How are you? ”Discuss some vital issues. But since these are films and series, the minus dataset also lies here. There is a lot of fiction, a lot of fantasy that needs to be cleaned up carefully, and a lot of rather peculiar dialogues. I remember the first model that we trained at Open Subtitles, for some reason, it talked a lot about vampires out of place and out of place. To the question “Where are you from?” She sometimes answered: “I, your mother, are from the FBI.” It seems that not everyone wants their interactive product to behave this way.

    This is not the only problem with the subtitle dataset. How is it formed? I hope many of you know what srt files are. In fact, the authors of the dataset simply took the srt-files of these films and series, all the replicas from there and recorded in a huge text file. Generally speaking, in srt-files nothing is clear about who says which remark and where one dialogue ends and another begins. You can use different heuristics: for example, to assume that two consecutive replicas are always said by different speakers, or, for example, if more than 10 seconds have passed between the replicas, then these are different dialogs. But such assumptions are fulfilled in 70% of cases, and this creates a lot of noise in the dataset.

    There are works in which authors try, for example, relying on the vocabulary of speakers, to segment all replicas in subtitles on who says what and where one dialogue ends and another begins. But no direct very good results have been achieved so far. It seems that if you use additional information - for example, a video or an audio track - you can do better. But I don’t know any such work yet.

    What moral? You need to be careful with subtitles. You can probably pre-train models on them, but I do not advise learning to the end, taking into account all these minuses.

    The next dataset, which is very popular in the scientific literature, is Twitter. On Twitter, every tweet knows whether it’s root or is an answer to some other tweet. Root in the sense that it is not written as an answer. Accordingly, this gives us an exact breakdown into dialogs. Each tweet forms a tree in which the path from the root, that is, from the root tweet to the leaf, is some kind of dialogue, often quite meaningful. The author and time of each replica is known on Twitter, you can get additional information about users, that is, something is written there directly on the user profile of Twitter. You can match a profile with profiles on other social networks and find out something else.

    What are the downsides of Twitter? First of all, it is obviously biased towards the placement and discussion of links. But it turns out that if you remove all the dialogs in which the root tweet contains a link, the rest - in many ways, not always, but often resembles the very secular conversation that we are trying to simulate. However, it also turns out that dialogues on secular topics, at least on Russian Twitter — I won’t guarantee English — are conducted mainly by schoolchildren.

    We found out as follows. We trained some model on Twitter for the first time and asked her some simple questions like “Where are you?” And “How old are you?”.

    In general, the only censorship answer to the question “Where are you?” Was “At school”, and all the rest were different except for punctuation marks. But the answer to the question “How old are you?” Finally put everything in its place. Therefore, what moral is there? If you want to learn dialogue systems on this dataset, then the problem of schoolchildren must somehow be solved. For example, you need to filter the dataset. Your model will talk as part of the speakers - you need to leave only the necessary part or use one of the speakers clustering methods, which I will talk about a little further.

    These two datasets are loved in the scientific literature. And if you are going to do something in practice, then you are largely limited only by your imagination and the name of the company you work for. For example, if you are Facebook, then you are lucky to have your own messenger, where a huge number of dialogs are just on those topics that interest us. If you are not Facebook, you still have some opportunities. For example, you can get data from public chats in Telegram, in Slack, in some IRC channels, you can parse some forums, scrap some comments on social networks. You can download movie scripts that actually follow some format, which in principle can be resolved automatically - and even understand where one scene ends there, where another ends and who the author of a particular replica is. Finally,

    We talked about the data. Now let's move on to the most important part. What kind of neural networks do we need to learn from this data so that we get something that can talk? I will remind you the problem statement. We want to predict what the next replica should be from the previous remarks that have been said so far in the dialogue. And all approaches that solve this problem can be conditionally divided into two. I call them "generating" and "ranking". In the generative approach, we model the conditional distribution of the response in a fixed context. If we have such a distribution, then in order to answer, we take its mode, say, or simply sample it from this distribution. A ranking approach is when we train some function of the relevance of the answer, subject to a context that does not necessarily have a probabilistic nature. But, in principle, this conditional distribution from the generative approach can also be with this relevance function. And then we take some pool of candidate answers and select from it the best answer for a given context using our relevance function.

    First, let's talk about the first approach - generating.

    Here we need to know what recurrent networks are. I honestly hope that if you came to a report where there are neural networks in the name, then you know what recurrent networks are, because from my confused minute explanation you are unlikely to understand what it is. But the rules are such that I have to talk about them.

    So, recurrent networks are such a neural network architecture for working with sequences of arbitrary length. It works as follows.

    A recursive network has some internal state, which it updates, passing through all the elements of the sequence. Conditionally, we can assume that it passes from left to right. And just as an optional recurrence network at each step can generate some kind of output that goes somewhere further in your multilayer neural network. And in classic neural networks called vanilla RNN, the function of updating the internal state is just some non-linearity over the linear transformation of the input and the previous state, and the output is also non-linear over the linear transformation of the internal state. Everyone loves to draw like this, or else unfold in sequences. We will continue to use the second notation.

    In fact, no one uses such update formulas, because if you train such neural networks, a lot of unpleasant problems arise. Use more advanced architectures. For example, LSTM (Long short-term memory) and GRU (Gated recurrent units). Further, when we say “recurrence network”, we will assume something more advanced than simple recurrence networks.

    A generative approach. Our replica generation task in the context dialog can be thought of as a row-by-row generation task. That is, imagine that we take the whole context, all the previous said remarks, and simply concatenate them, separating the replicas of different speakers with some special character. It turns out the task of generating a line by line, and such problems are well studied in machine learning, in particular in machine translation. And the standard architecture in machine translation is the so-called sequence-to-sequence. And state of the art in machine translation is still a modification of the sequence-to-sequence approach. It was proposed by Sutskever in 2014, and later it was just adapted by its co-authors for our task, Neural Conversational Models.

    What is sequence-to-sequence? This is the recursive architecture of the encoder-decoder, that is, these are two recurrent networks: encoder and decoder. Encoder reads the source line and generates some of its condensed representation. This condensed representation is fed to the input of the decoder, which should already generate an output line or for each output line to say what is its probability in this conditional distribution, which we are trying to simulate.

    It looks as follows. Yellow is a network encoder. Suppose we have a dialogue between two speakers from two remarks “Hello” and “Zdarov”, for which we want to generate an answer. Speakers replicas we will separate with a special symbol end-of-sentense, eos. In fact, the proposal is not always shared, but historically it is called that way. First, we will immerse each word in some vector space, do what is called vector embedding. Then we feed this vector for each word to the input of the encoder network, and the last state of the encoder network after it processes the last word will be our condensed representation of the context, which we will send to the decoder input. We can, for example, initialize the first hidden state of the decoder network with this vector or, alternatively, for example, file it at every timestamp along with the words. The decoder network at each step generates the next replica word and receives the previous word that it generated at the input. This allows you to really better simulate the conditional distribution. Why? I do not want to go into details now.

    Everything decoder generates until it generates an end-of-sentence token. This means that "That's enough." And at the entrance to the first step, the decoder, as a rule, also receives the end-of-sentence token. And it is not clear that he needs to be submitted to the entrance.

    Typically, such architectures are taught using maximum likelihood training. That is, we take the conditional distribution of the answers for the contexts we know in the training sample and try to make the answers we know as likely as possible. That is, we maximize, say, the logarithm of such a probability in terms of the parameters of a neural network. And when we need to generate a replica, the parameters of the neural network are already known, because we trained and recorded them. And we just maximize the conditional distribution of the response or sample from it. In fact, it is impossible to maximize it precisely, so some approximate methods have to be used. For example, there is a method for stochastic maximum search in such encoder-decoder architectures. It’s called beam search. What is it, I won’t have time to tell you either,

    All modifications of this architecture that were invented for machine translation can be tried for Neural Conversational Models. For example, encoder and decoder are usually multi-layered. They work better than single-layer architecture. As I said, these are most likely LSTM or GRU networks, not regular RNNs.

    Encoder is usually bi-directional. That is, in fact, these are two recurrent networks that pass through the sequence from left to right and from right to left. Practice shows that if you go from only one direction, then until you reach the end, the network will already forget what was there first. And if you go from two sides, then you have information on the left and on the right at every moment. It works better.

    Then in machine translation there is such a trick, a technique called attention. His idea is about the following. Each time your decoder generates another word, you can additionally look at all the words or at the hidden representation at each timestamp in the encoder and somehow weigh them according to what you now need. For example, to generate another word, you need to find some next preposition in the input sequence or understand what named entity was defined there. The attention mechanism helps to do this, and it helps a bit in Neural Conversational Models, but actually a lot less than in machine translation. It seems that this is because in machine translation in most cases, to translate the next word, you need to look at one word in the original sequence. And when generating a replica, you need to look at a lot of words. And perhaps some techniques similar to those used in memory networks will work better here. Type multi-hole potential.

    In fact, what I just told you is already enough to create some kind of Neural Conversational Model - provided that you have data. And she will somehow talk. I can’t say that it’s directly very bad, but if you talk to her, you will inevitably encounter a number of problems.

    The first problem you will see is the so-called problem of too "general" replicas. This is a known issue of the encoder-decoder sequence-to-sequence model, which is as follows. Such models tend to generate some very general short phrases that are suitable for a large number of contexts. For example, “I don’t know,” “Okay,” “I can’t say,” etc. Why is this happening? You can, for example, read an article where the authors tried to formalize this phenomenon in some way and showed that it will inevitably occur in such architectures.

    In the literature on Neural Conversational Models, a number of solutions have been proposed, or, I would say, “crutches” to solve this problem. All of them are based on the fact that we continue to train models in the likelihood maximization mode, but when we generate a replica, we maximize not the likelihood, but some modified functional that contains this likelihood.

    The first idea that appeared in the literature was to maximize the reciprocal information between the answer and the context instead of credibility.

    What does this mean in practice? Here was an expression that we maximized by the answer. Now let's add such a member to it. This is a coefficient multiplied by the a priori probability of a response. In fact, this is a kind of generalization of mutual information between the answer and the context. This coefficient is equal to unity - it turns out just mutual information. If it is equal to zero, then the original functional is obtained. But it can take this parameter and intermediate values ​​so that you can configure something in your method.

    What is the meaning of this expression? If we maximize it, now we are looking not only for the appropriate answer subject to the context, but also penalizing the answers with a high a priori probability. That is, roughly speaking, we penalize those answers that are often found in the educational building and which can be said with or without reason - just these "Hello", "How are you," etc.

    To use this method, you now need to not only to train the sequence-to-sequence model, which gives the indicated probability, but also to train some language model on all kinds of answers - to get this probability. That is, there is a minus - you need two models.

    There is an alternative way to rewrite this functional, or rather, write another functional that is equal to the previous one up to a constant. It is factored in a slightly different way. Here, there is still our conditional probability of an answer subject to the context, and there is still the probability of the context subject to the answer. This can be interpreted as follows. We want not only answers that are relevant in this context, but also answers that make it easy to restore the original context. That is, if the answer is too general, “Okay” or “I don’t know,” then it is completely unclear in what context this was said. And we want to fine such answers. To use this technique, you need both the sequence-to-sequence model, which generates the response by context, and the sequence-to-sequence model, which generates the context by response.

    More recently, in an article submitted to the ICLR, a method was proposed in which only one model is needed. Here the idea is. When generating a replica, we randomly select some contexts from our pool - for example, from a training set. Then our functionality changes as follows. We subtract from it such a normalized probability of an answer, subject to a random context. Here the idea is about the same as on the previous slide. If our answer is appropriate for some significant number of random contexts, this is bad, which means that it is too general. And in fact, if you look at it formally, we just have a Monte Carlo score for MMI, which was recorded on the previous slide. But its charm is that you do not need an additional model, and empirically for some reason it works even better than an honest MMI.

    For example, honest MMI has such an unpleasant property that this member fines not only too general answers, but also grammatically correct answers, because grammatically correct answers are more likely than grammatically incorrect answers. As a result, if you carefully adjust coefficient A, the network begins to talk completely incoherently. This is bad.

    The next problem you will encounter is the problem of consistency of answers. It consists in the following. A network of the same questions, formulated differently or asked in different contexts, will give different answers. Why? Because the network was trained on the entire dataset in the mode of maximizing likelihood, that is, it learned to answer correctly on average by dataset. If some answer is often found in the dataset, then you can answer that way. The network has no idea about its own personality and that all of its responses must be coherent.

    If your entire dataset consists of answers from the same speaker, this will not create any problems, but it is unlikely that you have a dataset in which there are millions or tens of millions of such answers. Therefore, the problem must somehow be solved.

    Here is one of the solutions that was suggested in the literature in the article “A Persona-Based Neural Conversation Model”: let us each speaker additionally compare the vector in some latent space of the speakers. Just as we immerse words in latent space, we will feed this vector to the decoder in the hope that during training we will write down some information there that is needed to generate answers on behalf of this speaker. That is, roughly speaking, we will write down his gender, age, some lexical features, etc. And then we will have some kind of tool to control the behavior of the model. In other words, we can somehow configure the components of this vector and, possibly, get the desired behavior from the network.

    But it looks in the sequence-to-sequence architecture something like this: everything is as before, only here one more vector is added, which is fed to each timestamp of the decoder. For example, it concatenates with this embedding-vector of a word.

    Models with latent variables usually have a problem: the fact that we want some information about the speaker to be written in this vector does not mean that this will really happen during training. In general, the neural network has the right to dispose of the vector as desired. Her ideas do not necessarily coincide with ours. But if you train such a model, and then draw, say, this space of speakers on the plane using the t-SNE algorithm or something similar and look for some structure in it, it turns out that it is.

    For example, you can draw this space and mark the age of the speakers on it. Here, the bright points are, roughly speaking, schoolchildren, and the red points are people who are over 30 years old, if I am not mistaken. That is, it is clear that this space is layered, and mainly schoolchildren are located above. Next come the students, then young professionals and, finally, people who are over 30 or how old. In other words, there is some kind of structure. Good.

    You can do more. I looked for a certain number of Twitter users whether they would follow some of the accounts of liberal politicians or not follow, and I also painted it in the indicated space. Those who follow, found themselves mainly in the lower right corner of the space. This is another evidence that some structure is present there.

    The authors themselves in the article cite a sign that illustrates that their network has learned to answer questions consistently. Here she is asked a series of questions about her hometown, about where she came from, from which country, what she did in college, etc. And it seems like she answers consistently. In general, they give, say, a comparison of log-likelihood for models in which there is information about the speaker and which do not. It is argued that log-likelihood is better for models that know about the speaker.

    But they continue to say: "Our goal was not fully achieved, because it is just as easy to find a dialogue where this property is not fulfilled and where the model seems to be confidently, but periodically gets off and responds with an average dataset." That is, the problem has not been completely resolved, you need to work. That's all I wanted to tell about generative models. Now let's talk a little about ranking.

    Here the idea is this: instead of generating a response using some kind of probability distribution, we will rank the answers from a certain pool according to the relevance function of the answer, subject to the context that we will train. What are the advantages of this approach? You have full control over the response pool. You can exclude grammatically incorrect answers or answers with obscene vocabulary, for example. Then you will never generate them, and you take less risk than using the generating model that I spoke about earlier. The training of such architectures is orders of magnitude faster, and the problem of common answers is less apparent - because it is, more likely, typical of sequence-to-sequence encoder-decoder architectures.

    And the minus, obviously, is this: the many replicas that you can say are limited. And there, most likely, there will be a replica not for every situation. As soon as you need something not quite trivial, most likely it will not be in your pool.

    How are ranking models usually arranged? Approximately as follows. There are two networks that are already called here - both that network and the other - encoder. The task of one network is to obtain some condensed vector representation of the context, the other is the vector representation of the answer. Further, the relevance of the context, subject to the answer, is calculated using some function of comparing two vectors, but in the end we get a certain number that speaks of relevance. Such an architecture became popular after a 2013 Microsoft research article about DSSM, Deep Structure Semantic Models, in 2013. And subsequently, this architecture was also adapted more than once in many different articles for Neural Conversational Models.

    Network encoder, in principle, can be any, if they can receive a vector by a set of words. For example, it can also be recurrent networks - as in sequence-to-sequence architectures. Or you can go the simpler way: it can be a fully connected network on top of the average embedding words. It also works surprisingly well.

    As a rule, the function of relevance of the answer in the context is something simple, because we just need to compare two vectors, the scalar product or the cosine distance, for example.

    How are such models trained? Since they are not generative, then positive examples are not enough, negative ones are needed. If you want something to be ranked high according to your function, you need to say that it should be ranked low.

    Where to get negative examples? The classical approach is random sampling, when you simply take random remarks in your dataset, say that with a high probability they are inappropriate and rely on this assumption. There is a slightly more non-trivial approach called hard negative mining. There the idea is this: you select random replicas, but then choose from random ones those on which the model is currently most mistaken.

    Recently, the Palekh algorithm exists in Yandex web ranking . It relies heavily on a similar architecture, and an article on Habrahabr describes how this hard negative mining can work.

    Now you have positive and negative examples. What to do with all this? Need some kind of penalty function. As a rule, they act very simply: take the outputs of this Sim function, which is a scalar product or cosine distance, run through softmax and get a probability distribution for your positive example and some negative examples that you generated. And then, as in the generating models, they simply use the cross-entropy loss, that is, they want the probability of a correct answer to be greater in comparison with the probability of incorrect ones. There are all kinds of modifications based on tripletloss. This is something like max margin approaches, when you want the relevance of your answer on condition of contact to be greater than the relevance of a random answer on condition of context to some margin, as in SVN.

    How to find out which model is better? How is machine learning usually resolving this issue? You have a test sample, and you count some metric on it. The problem is that this will not work here, because if the answer of your model is not similar to the answer from the test sample, it does not mean anything at all. In other words, even to the trivial “Hello” one can come up with dozens of answers that are appropriate, but which have a couple of common letters with the answer in the test sample, nothing more. Therefore, initially everyone tried to use metrics from machine translation to solve this problem, which somehow compares your answer with what you wrote in a test sample. But all these attempts failed. There is even such an article where the correlation of metrics used in machine translation is considered with the perceived relevance of people answers. Correlation can be calculated from a test sample. And it turns out that there is practically no correlation. So, this method is better not to use.

    What way to use then? Now the State of the art, if I may say so, approach is to use crowdsourcing, that is, to take conditional mechanical turk and ask the Turkers: “Is this answer appropriate in this context? Rate on a scale of 0 to 5. " Or: “Which of these answers is more appropriate in this context?”. If you look at the literature, ultimately everyone has to compare models that way.

    And what is better: generating or ranking models? So we took a model that we trained on sequence-to-sequence on Twitter, took a ranking DSSM-like model and then on our crowdsourcing platform we asked employees to evaluate the relevance of each answer in this context, put one of three labels: bad, neutral or good . Bad means that the answer is syntactically incorrect, completely inappropriate or, for example, contains obscene language. Neutral means that it is appropriate, but is common and uninteresting. And good - that this is a syntactically correct and relevant answer. And we also asked people to generate some answers so that we had some kind of baseline that we can strive for. Here are the numbers.

    Interestingly, people have 10% of bad answers. Why it happens? It turns out that in most cases people tried to joke, but the employees on the crowdsourcing platform did not understand their joke. There, in my opinion, there was a question in the pool: "What is the main answer to everything?" The answer was “42”, and apparently no one understood what this meant. There 9 out of 10 is bad.

    What can you see here? Obviously, people are still far away. Ranking models work better - if only because there are many more interesting answers in the pool and it is easier to generate an answer with such a model. And sequence-to-sequence models work worse, but not that much worse. But, as you remember, they can generate an answer in any situation, so perhaps sequence-to-sequence models should be used. Or you need to combine sequence-to-sequence and ranking models in the form of some ensemble.

    In conclusion, I will repeat the main points of my report. In the past couple of years, Neural Conversational Models has been a really hot field of research in deep learning, one such area. She is engaged in a lot, including large companies: Facebook, Google. A lot of interesting things happen there. In principle, some of its fruits can be used now, but you should not expect that you will immediately get artificial intelligence. There are very, very many problems to be solved. And if you have significant experience with natural language texts, experience with interactive systems or with deep learning in this area - most likely we will find what to offer you.

    If you are interested, you can, for example, write to me . That's all. Thanks.

    Also popular now: