How to assemble a voice bot: speech recognition, speech synthesis and NLP in a few lines of code

  • Tutorial
We regularly write about voice bots and the automation of incoming and outgoing calls. Confirmation of delivery, orders, guessing and answering the client while we connect with the company - this is the whole story. In the comments, I reasonably noticed that I talk a lot about bots, but I show a little. This is easy to fix! Hackathon S7 in Gorky Park is in full swing, 50 teams are figuring out prototypes of various interesting things - and I have the opportunity to try to fit into as few lines of code as possible. Minimalism in the examples is cool.

How it will work


To demonstrate, I’ll make the simplest case: talking about the weather using the famous api.ai speech engine recently bought by google. In response to an arbitrary request, this service returns json with the result of "understanding" this request. If there is something like a weather request, you can use openweathermap to get a text description, for example, “cloudy”. Like now outside the windows of coworking. But, I hope, by the middle of the day it will turn out!

And the Voximplant platform will provide phone number rental, call reception, user speech recognition and response synthesis. One of our key features is JavaScript, which runs in our cloud in parallel with the call. And not just executed, but in real time. Plus, from this JavaScript you can make HTTP requests to other services, so we don’t need Backend as such: we will do everything in the same cloud that processes the call. So that the user has as few delays as possible between his speech and response. We have a bot, not a turn-based strategy with asterisk?

Step 1: get a phone number and answer an incoming call


I have a good introduction to Voximplant, but it is in English . Good for our clients around the world, but not very good for a tutorial article on Habré, so I will allow myself a brief retelling. After registering in the admin panel, you need to go to the scripts section and create a new script: the very JavaScript code that will be executed in the cloud. The simplest script will answer the call, synthesize “hello, habrauser” and hang up. Here is his code:


To organize the code and instruct when and what to do, we have “applications” and “rules”. We go to the applications section, create a new one, add a rule with a default dot-asterisk mask (which means “for calls to any numbers.” In our example, we will use the rented number and calls will obviously come “to this number.” But in general In this case, the call can come from another telephony or from web sdk - for such cases, the rules help to distribute them without making unnecessary ifs in the scripts) and assign the JavaScript script created to this rule.

What else do you need to call a number? Right, number. They are rented in the room to buy section. Important: “Real numbers” is a switch. If you click on it, the interface switches to the technical mode of virtual numbers for debugging. A room in Gotham City can be purchased for 1 cent. Calls to such numbers go through a single access number and extensions.

Having rented a number, from the top menu we go to the “my phone numbers” section and connect the created application to the number. Everything, you can call and check. By the way, if during the tests the starting balance is over, then you can write to me in PM and I will replenish. Habr is first of all a community, it is necessary to support the own.

Step two: trying to understand the caller


A little later I will show how to start recognition from JavaScript Voximplant and get text instead of voice. But for now, let's imagine that this already exists and we need to “understand” what the user said. To do this, register in api.ai , connect a Google account, go to the "prebuild agents" section and add a brain to the project that can talk about the weather. Well, how to "talk." Answer simple questions. Then in the left menu we select the created project and there we click on the gear icon. In the project settings window that opens, we are interested in “Client access token” - we can send requests through it. For example, this is how the weather in Moscow is recognized:


In response, you get a rather big json hidden under a spoiler. The most valuable is in the result key , where in action you can check the topic, and in address and city where you are interested in the weather. Please note that this is a very simple demo, and the question "what is the weather outside the window" you will get the address "outside the window".

Hidden text



Step three: find out the weather on Mars


Having received a city for which the caller wants to know the weather (or information that the caller is not talking about the weather at all), you can find out the weather itself. There are a million cookies for this, for demonstration I will use the first openweathermap.org that I get on which you can register and get the api key. Please note that the key does not start working immediately. An example of a request that returns the weather in Moscow:


In response, we get json in the same way, in which there is a description field ready for pronunciation . In Moscow it is now overcast:

Hidden text



Last Step: Putting It All Together


All that remains is to enable streaming recognition in Voximplant JavaScript scripts (we already wrote about it ), wait for a question from the user, make a request to NLP, get the name of the city, then make a request to the weather service, get the weather description and synthesize it into a call. Less than a second will pass for the user, and all this will provide the following code:


Also popular now: