Development of a chat bot with a given personality. Lecture in Yandex

    An important feature of machine learning tasks is that an equally good result is achievable by different methods. This gives excitement to ML competitions: even if you have other competencies than your obviously strong opponent, you can still win. The Tensorborne and Neurobotics teams had almost equal chances of winning the DeepHack hackathon and eventually took the first two places. At a training session of Yandex, representatives of both teams made one voluminous report. In deciphering you are waiting for detailed analysis of solutions and tips for novice contestants.

    And of course, take a vacation to the hackathon. When you participate in a weekly hackathon and at the same time also work, this is bad. You come at 7 in the evening, a little work, sit down and compile Docker with TensorFlow, Keras, so that it all runs on some remote servers, to which you do not even have access. Somewhere in two nights you catch a catharsis, and it works for you - without the Docker, without everything, because you realized that you can and so.

    Vitaly Davydov:
    - Hello, everyone! We should have had two reports, but we decided to combine them into one big one, because we are talking about the first and second place in the DeepHack competition. We represent two teams. Our team Tensorborne took 2nd place, and the team of Gregory Neurobotics - the first.

    The report will consist of three main parts. In the intro I will talk about the background of DeepHack, about what it is, what were the metrics, etc. Next, the guys will talk about solutions, what were the problems, examples, etc.

    Before talking about DeepHack, it should be noted that it is a small subset of another very large global competition ConvAI2, which launched Facebook last year. This year, the second iteration. At some point, Facebook was sponsored by the Moscow Institute of Physics and Technology, and a DeepHack competition was created on the basis of the Fiztekh laboratory.

    Read more about ConvAI itself. What problem is he trying to solve? He specializes in conversational systems. The problem with dialog systems is that there is no single evaluation tool, the evaluation tool, to understand the quality of the dialogs. This thing is very subjective from person to person: someone may like the conversation, someone does not. ConvAI's overall global task is to come up with a common single metric for evaluating dialogs, which is not yet available. The prize is 20 thousand dollars for AWS Mechanical Turk. These are not credits on Amazon, these are credits only on Mechanical Turk, which in fact is an analogue of Yandex.Toloki. This is a crowdsourcing service that allows you to mark up data.

    The task, which is built on ConvAI, is to build a chit-chat bot with which to make some kind of dialogue. They selected three metrics: Perplexity, Hits @ 1, and F1. Then I will show the table that was at the time of our submission.

    The evaluation in which they tried to do this was carried out in three stages. The first stage is automatic metrics, then the assessment on AWS Mechanical Turk, and then live chat with volunteers.

    Since ConvAI is sponsored by Facebook, it is actively promoting its library to create ParlAI interactive systems. It is rather complicated, but I think all participants used this library. We figured it out for quite a while, it is not compatible with Python 3.6, for example, and there are a number of problems with it.

    In these few lines you can see what positions we held at the time of submission. In general, ConvAI is oddly organized in the sense that there are three metrics and it is not very clear how the rankings on this table go. It can be seen that some of the commands are higher for some metrics, lower for some. The organization of all ConvAI was a bit strange.

    But there are three basic baselines. To select DeepHack, it was necessary to break through this baseline, and the top 10 best teams reached the final. In secret I will say that the decisions sent only 8 teams, and all reached the final. It was not very difficult.

    The task of DeepHack was a bit more clear and straightforward. We had to build a robot-chatter again, but which would emulate some given personality. That is, the input to the robot was given a description of a person, and during a conversation with him he had to reveal it. The prize was quite interesting - a trip to the NIPS this fall, which is fully sponsored.

    The metric, unlike ConvAI, was already different. There were two metrics, and the total metric is weighted between the two. The first metric is overall quality, an assessment of how well the bot responded adequately, how interesting it was to communicate, whether he wrote some kind of nonsense, etc. The second metric is role-playing, either 0 or 1. It means whether the bot got into the description that was given to it. The person who communicates with the bot does not see the description. Evaluation took place in the Telegram, that is, there was a single Telegram bot, and when the user started to communicate with him, he would get on some random bot from all submissions to be honest. Yandex and MIPT, apparently, poured some traffic there, and there were around 10 thousand conversations, as far as I remember.

    About the qualifying round already said. The final was full-time. It was held for seven days of work at the Moscow Institute of Physics and Technology, a cluster was given, a place, we sat there and worked. Evaluation was in fact every day, and the final score, the score of the bot at the end, was calculated in this way. The competition started on Monday, the first submit was on Tuesday, and the evaluation took place the next day. The decision you sent on Tuesday was sold on Wednesday with a weight of 1.5. What you sent on Wednesday - with a weight of 1.4, etc.

    About dataset that Facebook gave for learning. It is called Persona-Chat and is a description of two personalities and some kind of dialogue. There is a description of the first person and the second. In the process of describing the dialogue, they are trying to reveal each other. This is all that has been given. However, as always, in the competition it was not forbidden to use other, third-party datasets.

    An example of the dialogue of our team. If you read carefully, it is clear that the resulting bot works quite adequately and responds fairly correctly.

    Gregory will tell about the first place.

    Grigory Rashkov:
    - I would like to tell you about our experience of participating in the competition, our strategy and our decision.

    Firstly, the peculiarity of the competition is that this is a long duration, we did not have two days, like on a regular hackathon, but five days, during which we could make a lot of decisions.

    Very subjective assessments, because completely different people with their criteria evolved, in particular, the organizer of the hackathon Mikhail Bubtsev said that if he even guessed what profile he was talking about, but the bot contradicted his profile at some point, answered the question wrongly , as he has written, he chose a different profile, even if he guessed what it was about.

    And the third is the lack of validation. Participants could not make a small change and immediately get feedback.

    As in all horror films, our team at the very beginning decided to separate. The first group was engaged in our main solution based on Wasserstein GAN, the second group was engaged in the bot, the admin of the bot based on baseline. Because we had to send something on the first and second day.

    Link from the slide

    Briefly about baseline: Seq2Seq plus attention, which is slightly adapted for this particular task. How exactly? A phrase is input, embedding is taken from GloVe, but then the presentation of each phrase is considered as weighted embedding. Weights are selected based on the inverse of the word frequency. The more rarely a word comes across, the more weight it brings.

    Link from the slide

    This is to reflect the uniqueness of these characteristics. All this was going to a set, a matrix, a mask was built on the basis of this set and hidden state, then this mask was superimposed on the set, the context was obtained, and then it was connected, through non-linearity was fed to the input of decoders.

    We didn’t write our decision on the first day, we had to send something, so we wrote a baseline agent, but we set ourselves the task to somehow stand out from the whole gray mass of agents. To do this, we used a simple heuristic, our bot first started the dialogue, and he used an emoticon in this phrase. And it worked.

    Naturally, the next day, all the bots began to do first, and all had smiles. For the second day, the residents of Vilabaggio continued to work with the GAN, the residents of Vilaribo tried other heuristics.

    As a result, score on the quality of the dialogue, we have slightly improved, but we went around at the expense of the person. These are the results of the third day, there were only two days left. We understood that we would not have enough time to write a GAN and test it normally, because it learns for a long time, hard, we have to select a lot of hyperparameters. So we decided to switch to baseline, because it works so well.

    We had a task to improve the recognition of the user profile. We have proposed such heuristics. What was the problem? The user happily communicated with the bot, asked what his job was, a hobby, which car he was driving, the bot responded well to all of this, because the bot responded well at all. As a result, at the end of the dialogue, the user saw two profiles that did not relate to what was in the dialogue, simply because there were indicated other things than those asked by the user. Therefore, we decided that we need to give out somehow information from the profile.

    How to make it most logical? If a person has some interests, he will probably talk about them himself, look for common interests. Therefore, we decided that the bot will at some moments ask questions on its own, based on its profile. There was an interesting effect that the generator, which was written simply by the linguistic rules G, uses some fact A from the profile, as a result G (A) is entered into the dialogue, all this is sent to the bot, and the next time the model generates information, coming from both the profile and this dialogue, that is, it is more likely to say something related to the profile.

    What did it look like in reality? At the bot in the profile it is written that he is delighted with poetry, then during the conversation he asks if I like poetry. I say that yes, and then his model, not the generator that we built according to the rules, says that he likes to write poetry. Thus, the bot focused on his profile, and it worked.

    We are back in first place. Stayed last day. We noticed that nevertheless we are losing in the quality of dialogue.

    We used several more solutions. First, they used paraphrase, analyzed what other people said, because the organizers laid out this base, and noticed that many of them communicate with the bot is not entirely correct.

    Link from the slide

    An interesting local minimum arises in the bot: it responds very well to any insults, he agrees with them, and in order to fix this, we decided to use the Kaggle competition for toxic comments analyzer, wrote a very simple classifier, also with RNN attention. In that data there were the following classes, overlapping: insults, threats ... We decided not to teach a modelk separately, which would speak, because such a problem was encountered, but it was not very frequent. Therefore, we just wrote some gag that the bot answered, and everyone was happy.

    In addition, we used paraphrase to enrich the speech of our bot. We did this not very difficult either, we replaced the words from the phrase with synonyms, looked at the resulting n-grams in the phrase, so that they would not differ much from those that were originally, and then choose the combination suitable for the phrase with the highest probability.

    As an example of what was, the bot here says that it likes to listen to music, in its profile it is written enjoy, we have replaced it with like to. We are not sure whether the model itself generated it or our Paraphraser, but this thing has passed. Another remark that it was impossible to send just the data from the profile. There pentagrams were compared. If the pentagrams coincided with your line and your profile, then just this phrase did not pass, so the organizers arranged. Next, among other things, we added a dictionary of smiles.

    The second example, we have a lot of smiles. Then there were heuristics when the bot reacted to your behavior, that you don’t write to it for a long time. Also here, Paraphraser worked, and it gave a good result.

    The quality of the dialogue was the best for us, the quality of acting out the role, too.

    We tried to make the model generate a set of options, and we compared them with the profile. But it seemed to me that in this case the bot works worse, we could not carry out validation, only subjective assessments of two or three conversations. Therefore, we decided not to put such a thing, because the profile was recognized so well.

    Then we wrote a solution to the inverse problem, the second model, which, by dialogue, selected the desired profile. We planned to use it initially for training, in order to take the loss function from it and then distribute it to the grid. But this could worsen the very talker, so they decided not to put it that way. We also thought to use this thing for the behavior of the bot, but did not have time to test everything, and decided to refuse this thing. In addition, we decided to put smiles on the basis of the emotional coloring of the phrase, wrote a model, but did not find a suitable dataset, and those who used it didn’t mean much about that.

    Our team.

    Even if your main model, which you hope for, cannot write it or it gives a bad result, you should not give up right away, you need to try some simpler things, which is quite natural. And the second, that sometimes it’s worth looking at what your model lacks, and thinking about specific tasks, decomposing them and solving specific problem areas, which we did. Thanks for attention.

    Sergey Kolesnikov:
    - My name is Sergey Kolesnikov, I will represent the solution Tensorborne.

    We came up with a beautiful name, went to the competition, came up with a lot of different pieces to release two articles after that, but did not win the hackathon. Therefore, it will be called: "How not to win the hackathon, but still publish the damn two articles." Academicians, sir.

    The features of the competition in which we participated exceeded our motivation. Due to the fact that the assessment was made daily, the parcels had to be done daily as well, and the final reward was determined, as we like in RL, by discounted summation. All this developed into the fact that we had to send at least something every day so that it worked, and we received at least some kind of score. As a result, it really turned into what you want - you do not want, but you had to row.

    What did we have? Preview for the whole week.

    Despite the fact that the hackathon said it was a week, everything was decided in four days, which seems to be not enough for this task.

    Initially, there were five of us, all good academic graduates or so of the Physical and Technical Institute, so on Monday we came and gave a lot of suggestions, ideas that we could try, what deep learning models to try. True, we did not experiment with GAN, because we already experimented with them for texts and it does not work, so we took something simpler, besides there were very similar contests and we had pretrain models. On Tuesday, we were even able to launch something in deep learning, ML was wherever possible, we launched great dockers with support for GPU and other things for Tensorflow and Keras, we should be given a separate medal for this, since this is not so trivial as we would like.

    According to the results of Tuesday - they were promising, and we decided to slightly improve our ML with small heuristics and other things, and fell into seventh place. But thanks to our teammates, someone found ElasticSearch, and tried. There was a very awkward moment when ElasticSearch earned great, and DL-models and ML, and so on, were a little bit less robust. The end of the competition was nearing. And as noted by the previous speaker, we decided to bury in the direction that works. We took ElasticSearch, small heuristics and thought that good enough, and really good enough due to the fact that we took second place.

    More details. In reality, there were several DL solutions. The first DL solution was pretty simple. Who remembers, the year before or last year was the Quora paraphrase detection contest, and this year there was a Yandex competition for Alice on dialogue building and other things. You may notice that the tasks there are very close. In the first, it was necessary to say whether these two phrases were paraphrases, and in the second, it was necessary to continue the dialogue. We thought that since we are developing dialogue systems, then let's also just continue the dialogue well. And it worked fine, the dialogue with Quora was very personal.

    Basically it all looked like this, that we had some kind of encoder, usually we all train on our usual RNNs, and better LSTM with attention and other things. And then we standardly used either CosaineEmbedding Loss, presented below on the slide, or another embedding loss of the Tripler Loss type or something else, which embrains paraphrases or answers to a specific dialogue brings together, and not paraphrase, on the contrary, alienates, and so on. This was the first decision, it was on Tensorflow, Keras, it was ready, we tried, and it was pretty good.

    Another solution was born during the day two hakatons in the evenings. There is a wonderful guy Jeremy Howard, he promotes DL and ML for everyone, he has a wonderful two courses that bring you up to speed on this business and the rest, and he wrote his FastAI for this course. All this works on PyTorch, and in many respects even rewrites PyTorch, this is one of the downsides to this. But of the benefits, Jeremy has little to do with NLP; this year, in March, another student released an article where they taught LSTM all the best practices in the wonderful FastAI, with many tricks that he promotes in his course, and got SOTA almost for everything.

    Since I am a small PyTorch evangelist, I still managed to uproot this model from FastAI, stuff it, I can say, into my PyTorch framework, and even teach the whole thing for this task. Basically, we had some kind of conversational context, in fact, even if we had several sentences, you just concate this into one hefty offer. And then we had an answer, some suggestion. All this can be submitted to the same encoder from FastAI, called the Universal Language Model - and then it is difficult - Encoder.

    After that we get the context, which, suppose, consists of 1 to T time-steps, as we like in seq2seq. Then it turns out after the encoder degenerates T of such sequences. The encoder encodes each sentence and translates each word into a vector, and further uses a special pooling that was proposed in FastAI - Concat poolling. What is the point? We take the last presentation of the proposal as in the usual seq2seq, which even without attention. Next we take maxpool and minpool from all the sequences of these vectors, and we get a new vector of three times larger dimension, which is the encoding of all our offer.

    In fact, even if you forget about this hackathon, this wonderful pooling works just fine, and even in picture contests, if you have maxpool and minpool combined, it works just fine. After that, we provided the context, and our possible answer is to some representation of H, Hc and Ha. These vectors are transferred to additional linear models, the usual typical feedforward grids, and some embeddings are obtained. After we have received some embeddings, we can, in fact, teach any metric learning. In our case, we used the easiest way - CosineEmbedding Loss, it is available in PyTorch.

    At the end of this contest, I conducted some more small experiments and found out that loss can be used slightly different and will be even better. If anyone is interested, this is contrastive loss, it is outside this presentation, we have not had time to use it. We are left with two vectors, they are normalized and then just considered to be the cosine distance.

    This is our DL solution. ElasticSearch, heuristics and more.

    What turned out to be more robust? It was very disappointing when at some point ElasticSearch worked better. Somewhere on Wednesday or Thursday, we were still using Persona Dataset due to the fact that the format of the data provided by Facebook is, let's say, not very convenient, but parsing is still possible. And we translated it into something close or understandable to us. We had the conversational context from the arrays of sentences, we had the context of the Persona, again, an array of sentences, and we had the correct answer because Persona Dataset contained exactly these necessary dialogues, in which wonderful people at Amazon Mechanical Turk had a long discussion and tried to figure out who is who.

    All these wonderful things can be shoved perfectly into ElasticSearch, which, in the context of the dialogue and the context of the Person, will try to return the correct answer, which he did. In view of the fact that Persona Dataset itself contained about 10 thousand dialogs, if I'm not mistaken, during the hackathon this was basically enough. We, of course, thought about how to improve, but we didn’t fake such a date.

    However, I wanted something more. We had good performance on dialogue speed, but on Persona soon everything was not so smooth, due to the fact that in our training sample, in fact, there were no persons on whom the evaluation was conducted. The time has come for great heuristic solutions.

    The first thing we came up with was asking questions so that our bot was not passive and asked people what you like. Bred you to a dialogue. Then people responded more, in fact, did more work, we had to think less, just to respond well sometimes, but it seemed we could do it. And it worked great.

    Then we came up with a small perfect heuristics - dialogues for some scenario. In fact, we came up with only one scenario, because actually writing heuristics for dialogues is a lot of if-emails, and nobody wanted to write if-els, we are academics, after all.

    We also had a small dirty hack. When the last day arrived, we decided to put everything on the line and use what was allowed to tell less than pentagrams about our person. Therefore, our person could be easily understood. Again, I had to use a lot of heuristics in order not to talk about it in the forehead.

    What are the results? The unpleasant result for me personally is that sometimes a non-DL solution works much more robust and more stable and better than the wonderful DL that we tried initially. The fact is that you need time to do a lot of things, test, and so on. And in our case, we had no validation. In fact, we sent something, usually around two in the morning, and hoped that it would fly. Unfortunately, there was one slide where our team was on the bottom. This is when it did not fly. We sent, it fell, and we lost the day. After that, we began to look at DL with a bit of a reproach, since ElasticSearch objectively works.

    You can see how heuristics increased the personality score. It is a little magic, a little if-els, and you have almost 0.25–0.3 improvements, which gives a big boost to the final rate. To our sorrow, this worsens the dialogue value, because if-ellas always worsen your decision.

    Examples Regarding the first heuristics - at the end of the first line we blatantly and terribly use the rule of the organizers, but damn it, Friday night, and you sit and send to Docker ElasticSearch, which you met this week. We did everything we could. Next heuristics. Everyone remembers the wonderful dialogue dialogue. We walked a long time to the Also, try to guess ... He asks, and you say - no, funny you. And it is remarkable that this combination is so general that it worked. We were very good, especially at two in the morning when we sent it.

    Also, taking this opportunity, I want to say thanks to Valentin Malykh for the camera. One wonderful Thursday at two in the morning he helped us a lot. Seriously, if all the organizers approached the competitions so responsibly - that it is two in the morning, and the person does not sleep and helps the guys to lay out a solution, consults them. Due to the fact that everything happened in Docker, there were a lot of nuances when sending the solution.

    What is behind the scenes and what should be done? ConvAI is not close yet, it will only be to NIPS, and we can even try to do something more decent.

    First, ElasticSearch works really well. But, like any model, it has hyperparameters. And it seems to me that if we put them right, ElasticSearch will be about divine. Hope not better DL.

    Secondly, we need to try all our DL solutions. Agree: when you have four days to bring it into production, you are not a little bit inclined to select hyper parameters and so on. You simply take what works, and, having crossed, send it in a way.

    And I don't know why we didn't take baseline. This is a historical injustice, since we have entered the competition thanks to baseline. None of us knows why we have not tried it further. We just are - yes we have pre-trained-models (indiscernible - approx. Ed.), Use. Many questions.

    Of course, in our proposals, I personally said that I would make RL bandits and ensemble in two days. Under this, it is better not to commit, it will not work. All this can be improved - the bot and the truth will be much better. And perhaps it was necessary to improve heuristics. It seems that the correctly added heuristics, even with the same wonderful smiles and other things, and perhaps the correct reactions to all kinds of toxic comments, give another small boost. It seems logical.

    You should go to the hackathon with one goal - to take and win. In reality, our team also had Denis Antyukhov, and we participated with him in DeepHacks two years ago. We know NLP, and we came to this DeepHack with the thought, I quote, "drop the leaderboard." Not lowered, unfortunately. But the goal seems very logical. If you want to participate in contests, hackathons, kegglas, anything, then you should go with the aim of not just participating, but winning and breaking everyone.

    The right team is really the best key to success. Our team of five people was the most distributed-team of all on this hackathon due to the fact that we never met. Today is the first time, and I finally got a T-shirt from DeepHack. But in the meantime, we still got a good result. Take a good team. A strong team, if you know each other, can give a huge boost, since you will not spend two days to begin to understand each other.

    Start with baselines! Seriously. You don't even need to use your pre-trained-models. Starting with baselines is probably better.

    And of course, take a vacation to the hackathon. When you participate in a weekly hackathon and at the same time also work, this is bad. You come at 7 in the evening, a little work, sit down and compile Docker with TensorFlow, Keras, so that it all runs on some remote servers, to which you do not even have access. Somewhere in two nights you catch a catharsis, and it works for you - without the Docker, without everything, because you understand that you can and so.

    It seems that if you participate in a big competition, then still allocate a little more time for it than you can stay awake for a week, and participate. Go and win. Thank!

    Also popular now: