VoiceFabric: technology for speech synthesis from the cloud



    Today we’ll talk about the prospects and capabilities of VoiceFabric cloud service for developers and users. The service voices any text information in a synthesized voice in real time. Under the cut, we will talk in detail about our synthesis, the scenarios of its use (standard and not very), and how to connect it to our projects, as well as how it is unique.

    Why might you need speech synthesis?
    Over the history of the existence of the service, we have received from customers hundreds of different options on how to apply this technology. Sometimes it’s the task of adapting services and sites for people with visual impairments, but many use the synthesis capabilities and just for their own convenience (for example, for trivial listening to books in the car). Using speech synthesis can be extremely effective for solving business problems of large companies and startups.


    If you classify all the requests, you get a not-so-big list:
    1. The dubbing of books and articles for private use. You can make audio books and offer them to others.
    2. Voice acting of videos on YouTube and other video channels. Usually these are educational videos / lectures or foreign videos / interviews, the credits of which are in Russian.For example .
    3. Create audio podcasts based on RSS feeds and news feeds.
    4. Dubbing the content of the site. For example (button in the site header).
    5. Sounding of any dynamic information in the IVR-menu of call centers (telephony). You can also static messages too. Call in the CC of Russian Railways, Megafon, Russian Agricultural Bank, etc.
    6. Social networks. For example, we have a joint project with VKontakte.
    7. Mobile applications.
    8. Information messages in GHS networks: announcements at stations and in transport, various autoinformers, auto dialers, etc.
    9. Voices for robots and virtual consultants when texts are constantly changing and voicing all the options with the help of announcers is long and not very convenient.

    What kind of speech synthesis do we have
    ? At the moment these are 9 different voices:
    - 7 in Russian (2 male and 5 female);
    - 1 American English - Carol;
    - 1 voice of the Kazakh language - Asel . (According to our data, this is the only synthesis of Kazakh in the world, ready for industrial implementation, in any case, we did not find analogues, if you find it, drop it in the comments).

    All examples of voices can be heard here .
    Each of them is available in the format of 8000 Hz (for telephony) and 22050 Hz.

    Our Russian synthesis was developed by Russian scientists and developers. It contains all the rules and grammars, features and abbreviations inherent in Russian speech. And when creating foreign voices, we attracted native speakers to take into account their linguistic features and nuances.

    To understand how our Russian synthesis differs from foreign analogues, check its work on scoring arrays of unprepared informational text - a natural, colloquial text that was originally written so that people could read it. Such texts usually contain many abbreviations and abbreviations that are immediately understandable to man, but when they were written, it was not assumed that the machine would ever read them.
    Try voicing, for example, in Google TTS, the phrase: “University named after prof. Bonch-Bruevich is located in St. Petersburg, pr. Bolshevikov, d.22 ", or something similar, and then compare with our synthesis. On large implementations, we constantly encounter such texts. A striking example is the knowledge base in the call center, which was once filled for operators. In this case, translating the entire knowledge base into a machine digestible form is an expensive and long task.

    We also have support for Lipsync technology - this is when animated lips move to the beat of what they say. You can make virtual characters who correctly move their lips when they say something.

    And, of course, support for SSML markup (speech synthesis markup language).

    We also create unique custom voices. We even had the experience of creating a synthesized voice of a person who has long been “not with us”. Speech synthesis training was based on old records (even records), so the sound of the synthesis is appropriate. But, nevertheless, this is a real synthesis and he can read any modern text. You can listen to what happened here .

    A few words about how to embed synthesis into your project
    We offer two ways to use TTS VoiceFabric:

    1) An API key that is embedded in a web request.
    VoiceFabric API service exchanges information with the application using the HTTPS protocol. Text that does not exceed 4096 characters can be sent for synthesis by a GET request. Text up to 10 MB in size can be transmitted for synthesis by a POST request.
    The output sound file format is codec = pcm, bit = 16, rate = 8000, raw.
    All requests must be configured according to the HTTP protocol. Query string parameters: UrlEncode, delimiter & etc.
    All details are in the integration documentation .

    2) Web-service, where you can insert any text (ctrl + C | ctrl + V), select a voice and get voiced text as a sound file.

    Try , see, write comments. Feedback is very important to us.

    PS on my own behalf.
    I have been engaged in speech synthesis for a long time and I do not read many articles from Habr on the site, but rather I listen. I just don’t have time to read, and you can listen to interesting articles and at the same time do other things, or I’m generally making an MP3 article and leaving for the street.

    Only registered users can participate in the survey. Please come in.

    Therefore, I want to make a small vote: Would it be convenient for you? We are ready to offer Habru synthesis service for free.

    • 29.6% I would listen to articles on Habré in a synthesized voice. At least sometimes. 54
    • 17.5% I am not interested in this. 32
    • 22.5% I would listen, but on a mobile device. Without constant internet access. 41
    • 30.2% I would listen both on the Habr website and in the mobile. At least sometimes. 55

    Also popular now: