morfeusys November 10, 2014 at 23:27

SpeechMarkup API - Turning Speech into Data

The article will discuss how to get real data from any request in a natural language that your application can work with. Namely, about the SpeechMarkup REST API service , which converts a regular line of text into JSON with all found semantic entities with specific data in each of them.

Yes, yes, this is the very technology that underlies any voice assistant and is used in search engines.
It allows you to uniquely interpret the query, and then return the result to your application in the form of a regular data set.

In the article I will tell you why you can use this API and give a small example of a working application .

Why do we need this?

Today, all user interfaces are becoming more minimalistic and simple. Indeed, the simpler the interface, the faster and more comfortable it will be to use your service or application.
And instead of offering the user complex forms, in which you need to switch between fields, type something, choose something somewhere, etc., it is easier and more convenient to enter several words in one field.

Moreover, for example, in Android, at any time, you can click on the microphone and say the data that you don’t want to / inconveniently / drive in for a long time. In iOS, the situation with voice input has also improved due to the support of Russian in dictation. Already today, nothing prevents you from attaching voice input to your application , putting robots in a call centeror even create your own voice assistant for a smart home .

But even if speech recognition is not taken into account (the situation with which, although far from ideal, is improving from year to year ), we can say that in many cases replacing forms with a single field with plain text input will help make the service more convenient and understandable .
The user wrote / said, say, “Two tickets St. Petersburg tomorrow morning” , and your service immediately issued suitable flights! Or “Saturday at 6 p.m. football” - and the event was saved on the calendar! “Mikhalych come to work early tomorrow morning” - and sms left the necessary contact, or a task was assigned in the task tracker (or better - both that and that).

But not so simple ...

Well, we got the text from the user (or from some kind of speech recognition system), and then what to do with it? That's right - you just need to pull out the data necessary for our service from it and that's it! For example, date and time of flight, city of departure and arrival. Or date-time and reminder text.

Well, how simple ... It turns out that it’s quite difficult ...
Given that it is a natural language , with its inherent features such as morphology, arbitrary word order, recognition errors, etc., the task of correctly interpreting even a small sentence of 5-10 words becomes really complicated.

Say, the date can be specified both absolute and relative - "the day after tomorrow " or " in two days ", "December 2 "or" Saturday . "With time - the same thing. And the numbers can be indicated using numbers and words! Cities have synonyms ( St. Petersburg, St. Petersburg, Leningrad ), they can be written with a hyphen and without ( New York ) And to understand that a substring is a full name, and two adjacent surnames are different people, even more difficult ...

Do you want to solve this with the help of regexps? Or delve into the wisdom of NLP, mate linguistics, AI theory, etc. I don’t want to. Because I just need to get a couple of data from the line that are necessary for my application logic Iya.
What to do?

Because it is to solve this problem that you need an API like SpeechMarkup .
In fact, it does not perform speech recognition. It receives an ordinary line of input, which it then turns into JSON, where all entities listed in the desired format are indicated. Let's say, “In five minutes” will turn into “18:15”, “Saturday” - into “15.11.2014”, etc.

More precisely - here is an example of an answer

{
  "string": "через неделю васе пупкину из питера исполняется пятьдесят два года",
  "tokens": [
    { 
      "type": "Date",
      "substring": "через неделю",
      "formatted": "17.11.2014",
      "value": {"day": 17, "month": 10, "year": 2014}
    },
    {
      "type": "Person",
      "substring": "васе пупкину",
      "formatted": "Пупкин Вася",
      "value": {"firstName": "Вася", "surName": "Пупкин"}
    },
    { "type": "Text", "substring": "из", "value": "из" },
    {
      "type": "City",
      "substring": "питера",
      "value": [{"lat": 59.93863, "lon": 30.31413, "population": 5028000, "countryCode": "RU", "timezone": "Europe/Moscow", 
      "id": "498817", "name": "Санкт-Петербург"}]
    },
    { "type": "Text", "substring": "исполняется", "value": "исполняется" },
    {
      "type": "Number",
      "substring": "пятьдесят два",
      "value": 52
    },
    { "type": "Text", "substring": "года", "value": "года" }
  ]
}

As you can see, SpeechMarkup “marks out” the source text with data that it can find, and returns in the same order in which they appear in the text.

That is, our application can send a line and get back regular JSON, where each entity has its own type and a certain format, independent of the language of the original request! As written in the SpeechMarkup REST API documentation , entities such as dates, times, numbers, cities and names are currently supported . Well, everything else is marked as plain text.

Custom entities

The service appeared only recently, but it plans to provide service users with the ability to create their own entities and the logic for converting them into data of the desired format.

It is important to note that SpeechMarkup does not work with the request context . In other words, it is the task of the competitive service to interpret the data obtained from the text. That is, if your service is not interested in, say, the names of the full name, then it can ignore their markup and work with them like a normal string, if it is needed. How this happens - I will show you with a simple example.

Simple sample application

As an example of using the API, take a demo project that implements the functionality of the reminder service . Of course, any application on any platform written in any programming language can use the REST API, because all that is needed is to send an HTTP request with text and several parameters and get JSON back. In this example, we use JavaScript.

So what does our test reminder service do? Saves reminders. All that is needed from the user is to enter the text, which will then be interpreted, and if it has all the data, it will turn into a reminder. If someone’s name is present in the text, it is additionally highlighted in the list item. You can try clicking on the examples.

Let's look at the part of the JavaScript code that sends the request text and receives a response back from which it constructs a list item with data on the date, time and reminder text.

Sending text with parameters

  $('#form').bind('submit', function(event) {
    event.preventDefault();
    var val = $.trim(text.val());
    if (val) {
      var date = new Date();
      $.ajax({
        url: 'http://markup.dusi.mobi/api/text',
        type: 'GET',
        data: {text: val, timestamp: date.getTime(), offset: date.getTimezoneOffset()},
        success: onResult
      });
    }
    return false;
  });

Everything is simple here. When the user submits the form, we take the value of the text field and submit it using the GET method to

http://markup.dusi.mobi/api/text

Another 2 additional parameters are needed for the correct conversion of dates and times from text on the SpeechMarkup server side. This is the timestamp parameter , which is the current client date-time in milliseconds, and the offset parameter , which contains the UTC time offset in minutes. It is important to indicate them, because otherwise, the SpeechMarkup server will not know what the client means, for example, “in 5 minutes”.

And here is the code that processes the response

  function onResult(data) {
    var resp = JSON.parse(data);
    var item = createItem(resp);
    if (!item.text) {
      warning$.text('А что напомнить?');
    } else if (!item.time) {
      warning$.text('А во сколько напомнить?');
    } else {
      warning$.empty();
      if (!item.date) {
        item.datetime = moment();
        if (item.time.value.hour < item.datetime.hour()) {
          if (!item.time.value.part && item.time.value.hour < 12 && item.time.value.hour + 12 > item.datetime.hour()) {
            item.time.value.hour += 12;
          } else {
            item.datetime.add(1, 'd');
          }
        }
        item.datetime.hour(item.time.value.hour).minute(item.time.value.minute);
      } else {
        item.datetime = moment([item.date.value.year, item.date.value.month, item.date.value.day, 
                                                     item.time.value.hour, item.time.value.minute]);
      }
      items.push(item);
      appendItem(item, items.length - 1);
      text.val('');
    }
  }

Since we work with dates and times, it is convenient to use the Moment.js library .
Here is a bit more code, but it is also simple, and most importantly - it does not operate on text , does not parse it , but works with ready-made data generated by SpeechMarkup.

In this code, we are trying to construct a reminder from the available data. Namely, if no text or time is specified, then say so. And if everything is there except for the date, then understand by the specified time on which date to create a reminder.

At the beginning of the method, you saw the createItem call , which from the response collects an object for manipulation. Here is his code

  function createItem(resp) {
    var tokens = resp.tokens;
    var item = {text: tokens.length > 0 ? '' : resp.string};
    for (var i = 0; i < tokens.length; i++) {
      var token = tokens[i];
      switch (token.type) {
        case 'Person': item.text = $.trim(item.text + ' ' + '' + token.substring + '');
          break;
        case 'Date': item.date ? item.text = $.trim(item.text + ' ' + token.substring) : item.date = token;
          break;
        case 'Time': item.time ? item.text = $.trim(item.text + ' ' + token.substring) : item.time = token;
          break;
        default: item.text = $.trim(item.text + ' ' + token.substring);
      }
    }
    return item;
  }

Actually, this is the part that parses the response JSON from the server and either adds some entities to the reminder text, or to the date or time.
To fully understand what a token or substring is, let's go over the SpeechMarkup API a bit.

SpeechMarkup API

As we have already seen, SpeechMarkup accepts a line and some additional parameters as input, and outputs JSON with the original string ( string field ) and an array of found entities ( tokens field ) at the output . If the array is empty, then no specific entities were found and everything is plain text (do not forget that SpeechMarkup works with a specific set of entities that can soon be supplemented with your own).

Each token (entity) is an object in which the type of the entity ( type field ), the part of the string to which it refers ( substring ) and the converted final language-independent value ( value ) are indicated . For type Textthis field contains the substring itself.
An optional formatted field may also be present for compact presentation of data. For example, the date will be written in the format “DD.MM.YYYY”, the time will be “HH: mm: ss”, and the type Person will be written in the form “Last Name First Name Patronymic”.

Each entity type has its own value format in the value field . For dates, this is an object with the fields day, month, and year. For time - hour, minute, second.
For cities, this is not an object, but an array (because there are many cities with the same name). Each city has coordinates, population, country code and standard name.
In an entity of type Person (name) there are fields firstName, surName and patrName, some of which may be absent if the user specified, for example, only the name.

Based on this data, you can go through all the tokens in order (since they go exactly in the order in which they are indicated in the original text) and, depending on the type of entity and its value, apply one or another logic.
If time appears several times in the text, then everything except the first is added to the text. Same thing with dates. If there is a name in the text, then it is additionally highlighted in the text.

Eventually

SpeechMarkup offers a free API for marking up entities in queries in a natural language, which allows your application to interpret speech as well as plain text input. Over time, API users will also be able to create their own entities and the logic for converting them into data, which will allow you to create handlers for more specific requests.

Here are a few links to help you learn more about the project and keep abreast of innovations:
SpeechMarkup project site GitHub
documentation
Google+ community of developers

Tags: