Development of the Russian-speaking “analogue” Siri in 7 days

Published on March 18, 2012

Development of the Russian-speaking “analogue” Siri in 7 days

    After the release of the iPhone 4S with Siri “on board”, the owners of the rest of the Apple gadgets felt a little left out. Even Apple did not include Siri in its new iPad. Developers around the world have attempted to port Siri to other devices or write similar counterparts. And only the Russian-language App Store was silent. Probably all the developers are very busy, I thought, and decided to fix this annoying misunderstanding ...

    DISCLAIMER:


    1. The word "analogue" is not in vain taken in quotation marks. My application is not a bit analogous to Siri, but an amateur craft. I understand that to create something really similar to Siri, you need gigantic resources and a lot of money.
    2. Yes, I know that Apple explains that it does not support other iPhones, because of some special noise reduction chip built into the 4S. But I do not strongly believe in this, most likely their servers do not withstand the load from 4S. And if you connect all the Apple gadgets to Siri, the servers just crash.
    3. The application was created as Just for fun and did not pursue any practical goals. And besides this, the main work also worked.

    Why in 7 days?


    Initially, I decided not to spend a lot of time on this project for several reasons. Firstly, I read a lot of articles where it was written that Apple does not miss programs similar to Siri in the App Store. Moreover, he is trying to remove existing ones from the App Store, for example Evi. Therefore, there is a high probability that my program will not be missed. As it happened by the way, with the client for rutracker.org written by me. I sent the application 4 times to the review 4 times, corrected everything that the censors told me, but the program didn’t get into the App Store (I later spat on this business and posted the stripped-down version on w3bsit3-dns.com, so I won’t lose work). Secondly, of course, I do not have the resources to write a complete program.

    1st day. Design


    At first, I thought out the application logic itself. Naturally, all text to speech and speech to text conversions should be performed on the server. And the application itself is just an interface. In this case, the solution will work even on the weakest devices, as well as be cross-platform. For portability on Android and Windows Phone, you just need to write an interface on these platforms.

    Thus, the application logic turned out as follows:
    A) we record the interlocutor's speech and transmit it to the server for recognition;
    B) we receive the recognized line from the server, and we carry out easy initial processing. These are answers to the most common questions, cut off mats and curses, intercept words for Yandex search and weather forecast search. Other commands such as send SMS or check mail, so far decided not to embed because of fears not to go through the review;
    C) we send the filtered line to our server for recognition. And we get in response a line with the answer;
    D) send a response to the server for conversion to speech, get a link to the mp3 stream and play the response;

    Yes, it turns out slowly, but so far I do not see another option, except to combine all this on one server. But this is a completely different order of costs: a dedicated powerful server is most likely not one; purchase and licensing of a speech recognition engine for converting text to speech, etc. So let us dwell on such logic for now.

    Second day. Search engine


    I am looking for engines. This was not a small problem. Firstly, most of them are paid and not less than $ 50 per 1000 words, secondly, a very small amount recognizes Russian speech, and thirdly, the quality of those that recognize Russian is simply terrible.

    I stopped on the ispeech.org engine. Firstly, it allows you to do two transformations of “speech to text” and “text to speech” at once. Secondly, it has an SDK for the iPhone and when using this SDK, a key and recognition are free, and free. Naturally, for the sake of the "balls" I had to sacrifice something. He disgustingly recognizes Russian cities. Therefore, it is not realistic to find out the weather forecast in some difficult-to-pronounce city. In Moscow, no problem.

    Learning its API. I settled on the JSON format. I transmit the key to the server, the language to be recognized, service fields such as the sound file format and the speech itself, encoded in base64encode, .wave file. I get the answer, also in the JSON format, error, if an error. And a line of text and recognition accuracy, if success.
    In the same way, the inverse transformation is done. I send a line for speaking, a language, and service fields to the server and I get an mp3 stream in response, which I play.

    3rd day. I'm starting to write an application. Design


    I try to get something similar to Siri, but it does not repeat exactly, otherwise the censors will be cut off.
    Here's what happened.

    Well, I'm not a designer at all. The day wasted.

    Day four. I am writing application logic


    Nothing complicated with ordinary http post requests. I embed the API. First test. Hooray!!! It works, but not very fast. When WI FI is normal, although slower than the real Siri. With 3G, it slows down. With GPRS, it’s just torture, and you can’t wait for an answer. I understood the reason quickly. The wave file is compressed to the server, compressed by the ULAW codec, 44 kHz sampling. The file turns out to be gigantic, it is necessary to compress it for voice at 8 KHz. Something does not work out. I mark myself that there is a problem, I score on it and move on. I filter mats and curses.


    5th day. Integration of search in Yandex and weather. Sending to the App Store


    I highlight the key fields such as “search”, “search”, “find”, “weather”, etc. For reliability, it is necessary to ask again what exactly we are looking for and in which city the weather forecast is needed. It seems to be. It turns out that cities are poorly understood. So much labor disappears, but decided not to throw this feature away, suddenly the engine eventually learns to better understand the city. I'm testing again and again and again. Satisfied with the result. I post the application in the App Store, let it wait for the revue while I write my server.


    6th day. Linguistics and speech analysis. Server spelling


    I study literature on artificial intelligence and speech analysis. Quietly fucking up. I master the basics. I decide so far not to bother with artificial intelligence, but simply to parse the application into phrases, do the simplest analysis, select keywords and search them in the database already.
    I outline a brief idea in which direction to move. So I make up the knowledge base, by searching, I compare the keywords selected from the sentence with the base and issue the record that most closely matches the question.
    I find in open sources dictionaries for interlocutor programs, of course their quality is not enough and it will be necessary to refine them. But for a start it will do.

    I am not writing a complicated PHP program to search for answers on my server. So that outsiders did not access the server and did not drop it, it provided for the transfer of a token by the phone, which is hard-wired in the application. While on authorization, I decided not to bother much.
    I also decided not to transmit the phone coordinates to the GPS server yet, although I like the idea. Knowing the coordinates of the phone, you can use the API of a weather server to issue a weather forecast. You can also use the coordinates of the phone to find the nearest bars, cafes, shops. But again, we need a resource with a normal API, to which I sent a request and coordinates and received a clear answer. I wrote down this idea and postponed it for later if I would write a new version of the application.

    All the questions asked and the answers to them are entered into the database, by the way the UDID [IMEI] of the telephone too. Yes yes Big Brother is watching you (just kidding). In fact, this is necessary for the development of the program. Knowing the questions asked, I can quickly replenish the knowledge base and catch glitches of the program. UDID is needed for further development. I plan for the program to remember the previous questions, so I use the UDID to identify the phone. Knowing the previous questions, you can make the application behavior even more intelligent. Interestingly, does Siri take into account previous questions when building a dialogue?
    When searching for answers in the knowledge base, the full-text search MATCH-AGAINST is used. Regular SQL queries, nothing special.

    The seventh day. Today


    I tested how the search in the knowledge base works. I was satisfied. I sat down to write an article on Habr, and my 12-year-old son expressed a desire to teach a knowledge base.
    He found information on the Internet which questions Siri most often asked, and I laughed for a long time. At the moment I am writing this article, and he puts his understanding of the world “into the head” of the machine. That VKontakte is better than Odnoklassniki and more. Of course, I will then verify everything that he brought to the base there.


    Total


    What happened.
    For seven days, it’s quite possible to write a simple virtual interlocutor who can support the conversation and answer some questions. Of course, before Siri, he was like for the moon, but as a little entertainment is quite suitable. It is in the "Entertainment" category, if censors miss the application, it will fall.
    It can be easily ported to Android and Widows Phone.

    The disadvantages of the program.
    1. Long sending of speech to the server due to the wave format.
    I plan to reduce the sampling rate to 8 KHz, but I don’t know how yet.

    2. Not very good speech recognition, especially in Russian cities, by the recognition engine.
    Maybe I'll use the Google engine, he speaks better. But for him, you need to transcode the speech into the FLAC format, which I also don’t know how to do. We need to look for the appropriate library. And, of course, the question remains of the licensed purity of such a path .

    3. Runs slower than Siri.
    This is only solved by buying a speech recognition engine and installing it on your dedicated server. I'm not sure that I will go this way, it is very expensive.

    4. Can not much of what Siri can.
    Well, this problem is solved by the release of updates and the development of the knowledge base. This is just a matter of time and the funds allocated for this.

    If you missed some points, I’m ready to answer in the comments.

    UPD: At the request of the Habrazhitel and in order to avoid omissions, I added a video.
    www.youtube.com/watch?v=UzFGgH741Cw

    UPD2: Added another video.
    www.youtube.com/watch?v=LVlllVSyln8

    UPD3: Pre-release version here
    www.youtube.com/watch?v=JlkJva-TGfY