As I helped Alice not to respond to other names. Internship in Yandex

    Hi, Habr. My name is Alexey Rak, I am the developer of voice assistant Alice in the Minsk office of Yandex. I received this position, having passed here, in the same team, a three-month internship last year. About her I am going to tell you. If you want to try it yourself - here is a link to the internship of 2019 .

    How I arranged

    I study at the 4th year of BSU, in 2018 I graduated from the School of Data Analysis, I lived and live in Minsk.

    At first, I, like other ShAD graduates, received a link to the internship in 2018. Within a week after sending the questionnaire, it was required to allocate time, 6 hours in a row, for the execution of an online contest. It contained problems about probability theory, the ability to code, invent algorithms. It was possible to write the code in the language in which you know how. I wrote several tasks in C ++, several in Python, I chose a language depending on the ease of use for a particular task.

    When you send a solution, a verdict immediately comes to it, after which the problem can be solved anew in order to get a more correct answer. It took me a couple of hours out of 6 for all the tasks. I did not solve some of the problems on the first attempt.

    A few days later, recruiters contacted me and called me for the first full-time interview in the Minsk office. It was with Alexei Kolesov, the head of the acoustic models and biometrics team, where I had to work. The interview consisted in solving problems on a piece of paper or on a blackboard and answering questions on probability theory, algorithms, machine learning. I think that the background of the Olympiad programming would allow me to cope with the online contest, even if I had not studied at the SAD, but on the interview I had the experience of the SAD really useful.

    A few days later, the second meeting took place, where I was assigned two more tasks on the knowledge of the algorithms: warm-up and basic. With each task it was like this: I offered a solution, answered a few questions about this solution, and then wrote the code on a piece of paper.

    A few days later I was informed that I was accepted for an internship. It was supposed to last three full months (as it eventually turned out). The transition to a permanent position was not promised, but they said that such an option was possible.

    Beginning of work

    On the first day, after sorting out the organizational questions and getting a laptop, I went to have lunch with my colleagues. We talked, then I put together a team repository and took up the first task - creating a simple Python script to start running a ready-made program in several threads and thereby speed up its execution. In the process of creating the script, I became acquainted with the code review system - when other guys in the team verify your code. Knowing that your closest colleagues will deal with him first, and other developers in the future, you try to write more clearly. In Olympiad programming, everything is somewhat different: the speed with which you program is important, and it’s likely that you’ll no longer need to look through what you’ve written. On the other hand, when I had to face the situation with Yandex, the code would still have to be read,

    In the course of the internship, I solved several tasks similar to this script, but my main time was occupied by a much larger project - a new decoder for Alice’s spotter.

    In order for devices and applications in Yandex, where the assistant can be called up with a voice, everything works as the user expects, we need a quality spotter - a voice activation mechanism. Most often, the activation phrase (which needs to be pronounced to launch Alice) contains the very word Alice.

    Spotter includes the preparation of features (signs for machine learning), a neural network and a decoder.

    Previous decoder

    The previous version of the decoder worked by processing probability vectors. There is an acoustic model - the neural network, which for each frame (a fragment of speech with a duration of 10–20 milliseconds) returns the probability that it has now been spoken. Frames can overlap each other. The decoder contained a matrix with probabilities for the last 100 frames “heard” by the device. The sound of each letter corresponds to a certain vector of probabilities. The algorithm found the element with the highest probability in the vector for the letter A, after which it considered only the right part of the matrix with respect to this element. Then the operation was repeated for the letters L, I, C and A - every time the matrix was “cut off” by the found element. Sounds A at the beginning and end of the word are actually different - the second of them is called Shwa, it is similar to A, E and O at the same time.

    If the final probability turned out to be greater than the threshold value, then the algorithm considered that the word was indeed pronounced, and activated Alice for the user.

    Such a scheme led to the fact that the assistant sometimes spontaneously turned on not only when people said “Alice”, but also when he heard other words, for example, “Alexander”. The sounds in the first part of this word (“Alex”) follow in the same order and mostly coincide with the sounds in the word “Alice”. The difference is only in the letters E and K, but E in its sound is very close to AND, and the algorithm did not take into account the presence of the letter K.

    In theory, it is possible to search in spoken speech not only the word “Alice”, but also similar words. There are not so many of them: “Alexander”, “Alex”, “arrested”, “ladder”, “aristarkh”. If the algorithm believed that the user would most likely say one of them, then it would be possible to prohibit activation, regardless of the result of the main decoder.

    However, voice activation must operate even without the Internet. Therefore, the decoder is a local mechanism. It works thanks to the neural network, which each time runs directly on the user's device (for example, on the phone), without contacting Yandex servers. And since everything happens locally, the performance (of the same phone as compared to the whole data center) leaves much to be desired. To recognize not only the word "Alice" would mean significantly complicating the work of this small neural network and exceed the performance limits. Activation would become slower, the assistant would respond with a long delay.

    We needed a completely different decoder. My colleagues suggested that I implement the idea of ​​the Hidden Markov Model, HMM: at the time of the beginning of my internship, it was already well described by the community, and also found use in the Alexa assistant from Amazon.

    New HMM Decoder

    The HMM decoder builds a graph of 6 vertices: one for each sound in the word “Alice”, plus one more for all other sounds — another speech or noise. The probabilities of transitions between vertices are estimated on a sample of recorded and proannototed speech. For each sound heard, there are 6 probabilities: a coincidence with each of the five letters and with the sixth peak (that is, with any sound besides those found in the word “Alice”). If the user says “Alexander”, the decoder will be lost on K: the probability that the spoken sound is not part of the activation phrase will be too big and the assistant will not work.

    In the near future, these changes will be available to all users of Alice and the SpeechKit library.

    Completion of the internship and the transition to a permanent job

    Of the three months of internship, I spent one and a half writing a HMM decoder. At the end of the month and a half, the manager told me that the transition to a permanent position and an open-ended contract would be possible (although not guaranteed) if I continue to work as productively. At about the same time, I took a two-week vacation to go to Olympiad programming fees. When I returned, I started a new task - training spotters for various devices: Yandex.Telephone, on-board computer with Yandex.Auto and others.

    A couple of weeks later, about a month before the end of the internship, I had my first interview about a permanent position, and a few days later - the second, final one. I spoke with the leaders of related teams. At the first interview I was asked theoretical questions: about machine learning, neural networks, logistic regression, optimization methods. In addition, they asked about regularization, that is, about reducing the degree of retraining of a given algorithm and about which algorithms which regularization methods are applied to. The second interview was practical: we communicated via Skype with a colleague from Moscow, and in the process I dialed the code in a simple online editor.

    On my own initiative, I didn’t get a full-time job, but ¾ - the fact is that my studies at BSU are not over yet. At the permanent position, I, among other things, do automatic selection of threshold values ​​and other hyperparameters. At each time, the system gets the likelihood that the keyword "Alice" was uttered. The final classifier compares this probability with a threshold value and, if the threshold is exceeded, activates Alice. Previously, the threshold was chosen by the developers, the current task is to learn how to do it automatically.

    So I got to Yandex, keeping my place in the Alice team.

    Also popular now: