How we chose TTS for dubbing examples in the Dictionary

    The dictionary in Puzzle English helps users learn vocabulary along with audio and video puzzles, podcasts, movies, TV shows and songs. In the Dictionary, translations are accompanied by audio examples of words and expressions. For voice acting, we use live speaker recordings and TTS - text-to-speech system, speech synthesizers from text. Today we’ll tell you how the Vocalware TTS engine was chosen, why we want to connect the Amazon Polly system instead, and which tasks the person solves better than the robot.


    In the Dictionary, we used more than 20 voices with different accents, timbres, variations of pronunciation. Male and female voices are heard at different speech speeds. The “announcers” have names and countries of origin - the United States, the United Kingdom or Australia. Pronunciation options help users learn to speak and perceive a foreign language. This is the single word pronunciation switch:


    How to find a suitable TTS

    Based on the functional of the Dictionary, we need TTS, which supports at least three accents: American (General American), British (Received Pronunciation) and Australian. Male and female voices were required and transcription support was desirable.

    We were looking for TTS, which synthesizes speech close to a natural voice, gives a clear sound and is not too demanding on the quality of the Internet connection on the user's side. Puzzle English students live in different regions of Russia, use the service from mobile phones via 2G and 3G. I wanted TTS to be able to synthesize not only words, but also read phrases with expression.

    We attended to this problem in 2015, but found that it was almost impossible to find adequate TTS requirements. There were several engines on the market:

    Acapela- able to recognize and voice texts in 34 languages. More than 100 synthesized voices with different ages, emotions, accents. It produces high quality sound.

    Vocalizer - the voice sounds natural, the speech is clear. Various dictionaries are installed, volume, speed and stress are corrected.

    eSpeak - supports over 50 languages. Synthesized speech is not perfect, but legible, average sound quality. The disadvantage is that eSpeak synthesized speech files are saved in .wav format, and they take up a lot of space.

    RSynth - no documentation, speech quality is mediocre.

    Festival is a multilingual speech synthesis system that does not always work stably.

    Vocalware - more than 100 synthesized voices in 20 languages.

    Acapela and Vocalizer worked only on Android, other systems did not support. In addition, they were unstable, like Festival. The eSpeak and RSynth engines did not fit, because the quality of speech synthesis for the Dictionary should be perfect.

    Of these options, we chose the Vocalware engine, which met our criteria: accents, voices of different “speakers”, transcriptions. Then this engine offered one of the best qualities of the synthesis of an arbitrary text. With it, we have created more than a third of voice acting. Vocalware copes well with the translation of individual words, but not with whole phrases. They are translated into Puzzle English live speakers.

    Why we want to connect Amazon Polly

    Unfortunately, Vocalware does not keep up with the demands of the times.

    • The quality of speech synthesis in this TTS is not the best on the market. We give the user the opportunity to choose from the pronunciation options, and the better the voice acting, the more useful they will be for the student.
    • We occasionally encounter malfunctions of Vocalware. It happens that the service is unavailable up to two days in a row. This is unacceptable.
    • This TTS has no markup language support for SSML speech synthesis applications. Through SSML, you can customize intonation accents, pause length, and other parameters.

    The system with the best quality of synthesis appeared in Amazon, it is called Amazon Polly, another one is in development by Google - Cloud Text-to-Speech.

    Amazon Polly is better than Vocalware in all respects: it offers dozens of languages, male and female voices that sound more natural. The engine supports lexicons and SSML tags that allow you to control pronunciation, volume, pitch and speed. Polly works faster.

    Google Cloud Text-to-SpeechUntil it is released in production, it is in beta testing. At the heart of the engine is the technology WaveNet - the one that runs Google Translate and other Google services. She uses neural networks to make words and phrases sound natural. The service offers a choice of 30 voices with sound options. Adjusts the pitch of each voice, 20 semitones higher or lower than the original.

    We tested both systems and came to the conclusion that small companies that previously represented the TTS market missed their chance and were left behind. They are unlikely to make the product better than the giants - Google and Amazon. These corporations use huge amounts of data and computing power for voice models, and gradually take over the market.

    Now we are planning to switch to the Amazon solution, because the quality of speech synthesis at Polly is comparable to what WaveNet provides. Our favorite is the “announcer” for British English named Brian, which sounds most natural.

    Polly, unlike WaveNet, synthesizes Russian speech. This TTS has variations of English pronunciation with Irish and Indian accents. These pronunciations are useful for the English version of the site, which will be used by Indians who want to learn English. In this case, the system is cheaper.

    As a result of analyzing these TTSs, we have planned to add additional voices from Polly in the near future. The old “announcers” still remain: the meaning of the Dictionary is that the user can hear different variations of pronunciation. And to make the voice acting of compound phrases with the help of robots alone is not yet possible. On the service, many phrases are created via TTS, but it’s impossible to completely abandon live announcers.

    Why the robot is inferior to man in the voice acting phrases

    In Puzzle English, phrases are voiced by live announcers. The car turns out to sound simple sentences - narrative, with a question, a denial, without emotional coloring. It cannot cope with more complex text, it makes several typical mistakes.


    This pronunciation is separately for one word. Such voiceovers are not even approximately similar to speech, they do not contain intonation, phrasing, articulation, or meaning, because every word is pronounced under stress.

    Here's how the same phrase reads TTS in google translate and lively announcer.

    The robot makes small pauses between words, as if “chasing them.”

    The announcer uses phrasal stress, he divides a great sentence by meaning. The phrase is better perceived by ear.


    The machine usually cannot reproduce the desired intonation. This moment in the pronunciation of phrases is important for many English learners. Often, students think that it is enough to deliver sounds, and the speech will sound like that of an Englishman. This is not true. Alien gives the wrong intonation. A living person can select the necessary parts of a sentence if the context so requires. The robot will not do this. Listen again to the examples of phrases above and you will understand what it is about.

    Direct speech

    The machine does not distinguish direct speech, indicated by punctuation. She continues to read the text, keeping the overall intonation picture.

    So read a text native speaker:

    And so the robot:

    Emotions in conversation

    The robot does not recognize the fragments on which the carrier stresses to emphasize certain words, for example, when the phrase has an ironic tone. The robot generally retains neutral intonation.

    This is also heard in the previous examples.

    Invalid pronunciation speed

    A common mistake with a robot is stretching, which has the effect of inhibition. And, on the contrary, too fast pronunciation of a word or phrase gives “chewing” of the text.

    Unnatural stress

    The robot reads with stress every word, which is unnatural for lively speech.

    In this example, the robot highlights the preposition at.

    The announcer does not single out a pretext; in lively speech, at merges with playing and is itself unstressed.

    Google and Amazon engines read phrases better than other TTSs we tested. According to the results of the analysis, both decisions from large corporations did not cope with six phrases with complex intonation and coped well with only five. Google has two standard “announcers” who have read poorly, two are satisfactory, and Amazon has poorly two and only one is satisfactory.

    The result of Google is slightly better overall, but some of the Amazon Polly voices seemed more interesting, since the voice and intonations sounded more natural. In general, it is already possible to entrust the pronunciation of TTS phrases, but not in all cases and not in the product for foreign language learners. They are important quality and nuances of pronunciation, which the robot can not always convey.


    With TTS, you can voice individual words in different languages ​​for your services. New solutions from Amazon and Google are better at this task than the previous engines of small companies. But phrases, especially complex sentences with several commas, are unnatural in their performance. The robot cannot distinguish direct speech, convey irony, make a semantic stress, choose the correct intonation for the separation question at the end of the sentence. For our purposes, this is unacceptable, so we ask live announcers to voice such materials and continue to test new offers in this market.

    If you want to upgrade English, come to us.

    Readers of the blog give a coupon for 700 rubles for the purchase of "Tasks".

    Also popular now: