Do-it-yourself neural network speech synthesis

From the sandbox

Speech synthesis is currently used in various fields. These are voice assistants, and IVR systems, and smart homes, and much more. The task itself, for my taste, is very clear and understandable: the written text should be pronounced as a person would do.

Some time ago, machine learning came into the field of speech synthesis, as in many other areas. It turned out that a number of components of the entire system can be replaced with neural networks, which will allow not only to approach the quality of existing algorithms, but even significantly surpass them.

I decided to try to make a completely neural network synthesis with my own hands, and at the same time share my experience with the community. What came of this, you can find out by looking under the cat.

Speech synthesis

To build a speech synthesis system, you need a whole team of specialists from different fields. For each of them there is a whole host of algorithms and approaches. Doctoral dissertations and thick books describing fundamental approaches have been written. Let’s start with a superficial understanding of each of them.

Linguistics

Normalization of the text . First, we need to expand all abbreviations, numbers, and dates into text. The 50s of the 20th century should turn into the fifties of the twentieth century , and the city of St. Petersburg, Bolshoi pr. P.S. to the city of St. Petersburg, Bolshoi Prospect of the Petrograd Side . This should happen as naturally as if a person were asked to read what was written.
Preparation of stress dictionary . The accents can be arranged according to the rules of the language. In English, emphasis is often placed on the first syllable, and in Spanish - on the penultimate. Moreover, from these rules there are a whole host of exceptions that are not amenable to any general rule. They must be taken into account. For the Russian language in the general sense, the rules for the placement of stress do not exist at all, so without a dictionary with the stresses placed, there is absolutely no way to go.
Removal of homography . Homographs are words that coincide in spelling but differ in pronunciation. A native speaker can easily place stress: a door lock and a castle on a mountain . But the key to the lock is a more difficult task. Completely remove the homography without taking into account the context is impossible.

Prosodica

Syntagma highlighting and pause . A syntagma represents a relatively finished segment of speech in meaning. When a person speaks, he usually inserts pauses between sentences. We need to learn how to divide the text into such syntagmas.
Determination of the type of intonation . The expression of completeness, question and exclamation are the simplest intonations. But expressing irony, doubt or enthusiasm is a much more difficult task.

Phonetics

Getting transcription . Since in the end we work with pronunciation, and not with writing, it is obvious that instead of letters (graphemes), it is logical to use sounds (phonemes). Converting a grapheme recording into a phoneme is a separate task, consisting of many rules and exceptions.
Calculation of intonation parameters . At this point, you need to decide how the pitch and the pronunciation speed will change depending on the pauses placed, the selected phoneme sequence and the type of intonation expressed. In addition to the basic tone and speed, there are other parameters with which you can experiment for a long time.

Acoustics

Selection of sound elements . Synthesis systems operate with the so-called allophones - realizations of the phoneme, depending on the environment. Records from the training data are cut into pieces by phoneme marking, which form an allophone base. Each allophone is characterized by a set of parameters, such as context (phonemes neighbors), pitch, duration, and others. The synthesis process itself is the selection of the correct sequence of allophones, the most suitable in the current conditions.
Modification and sound effects . For the resulting recordings, sometimes post-processing is needed, some special filters that make the synthesized speech a little closer to human speech or correct some kind of defects.

If all of a sudden it seemed to you that all this could be simplified, figured out in your head or quickly picked up some heuristics for individual modules, then just imagine that you need to make a synthesis in Hindi. If you don’t know the language, then you won’t even be able to evaluate the quality of your synthesis without attracting someone who knows the language at the right level. My native language is Russian, and I hear when the synthesis is mistaken in stress or speaks with the wrong tone. But at the same time, all synthesized English sounds about the same to me, not to mention more exotic languages.

Implementations

We will try to find the End-2-End (E2E) implementation of the synthesis, which would take on all the difficulties associated with the subtleties of the language. In other words, we want to build a system based on neural networks that would receive text as input and produce synthesized speech as output. Is it possible to train such a network that would replace a whole team of specialists from narrow areas with a team (possibly even from one person) specializing in machine learning?

On the end2end tts request, Google produces a whole host of results. At the head is the implementation of Tacotron from Google itself. It seemed easiest to me to go from specific people on Github who are engaged in research in this area and share their implementations of various architectures.

I would single out three:

Look at them in the repository, there is a whole storehouse of information. There are a lot of architectures and approaches to the problem of E2E synthesis. Among the main ones:

Tacotron (version 1, 2).
DeepVoice (versions 1, 2, 3).
Char2Wav.
DCTTS.
WaveNet

We need to choose one. I chose Kyubyong Park's Deep Convolutional Text-To-Speech (DCTTS) as the basis for future experiments. The original article can be viewed here . Let's take a closer look at the implementation.

The author laid out the results of the synthesis on three different bases and at different stages of training. For my taste, if not a native speaker, they sound pretty decent. The last of the English databases (Kate Winslet's Audiobook) contains only 5 hours of speech, which is also a big advantage for me, since my database contains an approximately comparable amount of data.

Some time after I trained my system, information appeared in the repository that the author had successfully trained the model for the Korean language. This is also quite important, since languages can vary greatly and robustness in relation to the language is a nice addition. It can be expected that during the training process a special approach to each set of training data will not be required: language, voice, or some other characteristics.

Another important point for this kind of systems is the training time. Tacotron on that iron which I have, according to my estimates, would study about 2 weeks. For prototyping at the initial level, it seemed to me too resource-intensive. Of course, pedals would not have to be twisted, but it would take a lot of calendar time to create some basic prototype. DCTTS in the final version learns in a couple of days.

Each researcher has a set of tools that he uses in his work. Everyone picks them to their liking. I really like PyTorch. Unfortunately, I did not find an implementation of DCTTS on it, and I had to use TensorFlow. Perhaps at some point I will post my implementation on PyTorch.

Training Data

A good basis for the implementation of synthesis is the main guarantee of success. The preparation of a new voice is very thoroughly approached. A professional announcer pronounces pre-prepared phrases for many hours. For each utterance, you need to withstand all the pauses, speak without jerks and slowdowns, reproduce the correct outline of the fundamental tone and all this together with the correct intonation. Among other things, not all voices sound equally pleasant.

I had on my hands a base of about 8 hours, recorded by a professional announcer. My colleagues and I are now discussing the possibility of making this voice freely available for non-commercial use. If everything works out, then a distribution with a voice, in addition to the recordings themselves, will include the exact text for each of them.

Let's start

We want to create a network that receives text as input, and produces synthesized sound as output. The abundance of implementations shows that this is possible, but of course there are a number of reservations.

The main system parameters are usually called hyperparameters and are taken out in a separate file, which is called accordingly: hparams.py or hyperparams.py , as in our case. Everything that can be twisted without touching the main code is taken out in hyperparameters. Starting from directories for logs, ending with the size of hidden layers. After that, the hyperparameters in the code are used like this:

from hyperparams import Hyperparams as hp
batch_size = hp.B  # размер батча берем из гиперпараметров

Further on, all variables with the hp prefix . taken from the hyperparameters file. It is understood that these parameters do not change during the training process, so be careful when restarting something with new parameters.

Text

For processing text, the so-called embedding layer, which is placed first, is usually used. Its essence is simple - it’s just a plate that associates a character vector with a character vector. In the learning process, we select the optimal values for these vectors, and when we synthesize according to the finished model, we simply take the values from this very plate. This approach is used in the already widely known Word2Vec, where a vector representation for words is built.

For example, take the simple alphabet:

['a', 'b', 'c']

In the learning process, we found out that the optimal values of each of their symbols are as follows:

{
    'a': [0, 1],
    'b': [2, 3],
    'c': [4, 5]
}

Then for the aabbcc line after passing through the embedding layer, we get the following matrix:

[[0, 1], [0, 1], [2, 3], [2, 3], [4, 5], [4, 5]]

This matrix is then fed to other layers that no longer operate on the concept of a symbol.

At this moment, we see the first restriction that appears in our country: the set of characters that we can send for synthesis is limited. For each character, there must be some non-zero number of examples in the training data, preferably with a different context. This means that we need to be careful in choosing the alphabet.

In my experiments, I settled on the option:

# Алфавит задается в файле с гиперпараметрами
vocab = "E абвгдеёжзийклмнопрстуфхцчшщъыьэюя-"

This is the alphabet of the Russian language, a hyphen, a space and the designation of the end of the line. There are several important points and assumptions:

I did not add punctuation marks to the alphabet. On the one hand, we really do not pronounce them. On the other hand, according to punctuation marks, we divide the phrase into parts (syntagmas), dividing them with pauses. How does the system pronounce executions cannot be pardoned ?
There are no numbers in the alphabet. We expect that they will be expanded into numerals before applying for synthesis, that is, normalized. In general, all the E2E architectures that I saw require exactly normalized text.
There are no Latin characters in the alphabet. The English system will not be able to pronounce. You can try transliteration and get a strong Russian accent - the notorious years mi speck from my hart .
There is a letter e in the alphabet . In the data on which I trained the system, it stood where it was needed, and I decided not to change this alignment. However, at the moment when I was evaluating the results, it turned out that now, before applying for the synthesis, this letter also needs to be set correctly, otherwise the system pronounces exactly e , not e .

In future versions, you can pay more attention to each of the items, but for now let us leave it in such a slightly simplified form.

Sound

Almost all systems operate not with the signal itself, but with various kinds of spectra obtained on windows with a certain step. I will not go into details; there are quite a lot of different kinds of literature on this topic. Focus on implementation and use. Two types of spectra are used in the DCTTS implementation: amplitude spectrum and chalk spectrum.

They are considered as follows (the code from this listing and all subsequent ones is taken from the DCTTS implementation, but modified for clarity):

# Получаем сигнал фиксированной частоты дискретизации
y, sr = librosa.load(wavename, sr=hp.sr)
# Обрезаем тишину по краям
y, _ = librosa.effects.trim(y)
# Pre-emphasis фильтр
y = np.append(y[0], y[1:] - hp.preemphasis * y[:-1])
# Оконное преобразование Фурье
linear = librosa.stft(y=y,
                      n_fft=hp.n_fft,
                      hop_length=hp.hop_length,
                      win_length=hp.win_length)
# Амплитудный спектр
mag = np.abs(linear)
# Мел-спектр
mel_basis = librosa.filters.mel(hp.sr, hp.n_fft, hp.n_mels)
mel = np.dot(mel_basis, mag)
# Переводим в децибелы
mel = 20 * np.log10(np.maximum(1e-5, mel))
mag = 20 * np.log10(np.maximum(1e-5, mag))
# Нормализуем
mel = np.clip((mel - hp.ref_db + hp.max_db) / hp.max_db, 1e-8, 1)
mag = np.clip((mag - hp.ref_db + hp.max_db) / hp.max_db, 1e-8, 1)
# Транспонируем и приводим к нужным типам
mel = mel.T.astype(np.float32)
mag = mag.T.astype(np.float32)
# Добиваем нулями до правильных размерностей
t = mel.shape[0]
num_paddings = hp.r - (t % hp.r) if t % hp.r != 0else0
mel = np.pad(mel, [[0, num_paddings], [0, 0]], mode="constant")
mag = np.pad(mag, [[0, num_paddings], [0, 0]], mode="constant")
# Понижаем частоту дискретизации для мел-спектра
mel = mel[::hp.r, :]

For calculations, almost all E2E synthesis projects use the LibROSA library ( https://librosa.github.io/librosa/ ). It contains a lot of useful things, I recommend that you look into the documentation and see what is in it.

Now let's see how the magnitude spectrum looks on one of the files from the database that I used:

This option for representing window spectors is called a spectrogram. The time in seconds is located on the abscissa, and the frequency in hertz is on the ordinate. The amplitude of the spectrum is highlighted in color. The brighter the point, the greater the amplitude.

The chalk spectrum is the amplitude spectrum, but taken on the chalk scale with a specific step and window. We set the number of steps in advance; in most implementations, the value 80 is used for synthesis (set by the hp.n_mels parameter ). The transition to the chalk spectrum can greatly reduce the amount of data, but at the same time preserve the characteristics important for the speech signal. The chalk spectrogram for the same file is as follows:

Note the thinning of the chalk spectra over time on the last line of the listing. We take only every 4 vectors ( hp.r == 4 ), respectively, thereby reducing the sampling frequency. Speech synthesis comes down to predicting chalk spectra from a sequence of characters. The idea is simple: the smaller the network has to predict, the better it will cope.

Well, we can get a spectrogram by sound, but we can't listen to it. Accordingly, we need to be able to restore the signal back. For these purposes, systems often use the Griffin-Lim algorithm and its more modern interpretations (for example, RTISILA, link ). The algorithm allows you to restore the signal from its amplitude spectra. The implementation I used:

defgriffin_lim(spectrogram, n_iter=hp.n_iter):
    x_best = copy.deepcopy(spectrogram)
    for i in range(n_iter):
        x_t = librosa.istft(x_best,
                            hp.hop_length,
                            win_length=hp.win_length,
                            window="hann")
        est = librosa.stft(x_t,
                           hp.n_fft,
                           hp.hop_length,
                           win_length=hp.win_length)
        phase = est / np.maximum(1e-8, np.abs(est))
        x_best = spectrogram * phase
    x_t = librosa.istft(x_best,
                        hp.hop_length,
                        win_length=hp.win_length,
                        window="hann")
    y = np.real(x_t)
    return y

And the signal from the amplitude spectrogram can be restored like this (steps opposite to obtaining the spectrum):

# Транспонируем
mag = mag.T
# Денормализуем
mag = (np.clip(mag, 0, 1) * hp.max_db) - hp.max_db + hp.ref_db
# Возвращаемся от децибел к аплитудам
mag = np.power(10.0, mag * 0.05)
# Восстанавливаем сигнал
wav = griffin_lim(mag**hp.power)
# De-pre-emphasis фильтр
wav = signal.lfilter([1], [1, -hp.preemphasis], wav)

Let's try to get the amplitude spectrum, restore it back, and then listen.

Original:

Recovered signal:

For my taste, the result has become worse. The authors of Tacotron (the first version also uses this algorithm) noted that they used the Griffin-Lim algorithm as a temporary solution to demonstrate the capabilities of the architecture. WaveNet and similar architectures allow you to synthesize speech of better quality. But they are more heavyweight and require some effort for training.

Training

The DCTTS we selected consists of two virtually independent neural networks: Text2Mel and the Spectrogram Super-resolution Network (SSRN).

Text2Mel predicts the chalk spectrum in the text using the Attention mechanism, which links two encoders (TextEnc, AudioEnc) and one decoder (AudioDec). Please note that Text2Mel restores precisely the sparse chalk spectrum.

SSRN restores the full amplitude spectrum from the chalk spectrum, taking into account frame omissions and restoring the sampling frequency.

The sequence of calculations is described in some detail in the original article. In addition, there is the source code for the implementation, so you can always debug and delve into the subtleties. Please note that the author of the implementation moved away from the article in some places. I would highlight two points:

There were additional layers for normalization (normalization layers), without which, according to the author, nothing worked.
The implementation uses a dropout mechanism for better regularization. This is not in the article.

I took a voice that included 8 hours of recordings (several thousand files). Left only records that:

The text contains only letters, spaces and hyphens.
The length of the text does not exceed hp.max_N .
The length of the chalk spectra after dilution does not exceed hp.max_T .

I got a little more than 5 hours. I calculated the necessary spectra for all the records and, in turn, started training Text2Mel and SSRN. All this is done quite ingenuously:

$ python prepro.py
$ python train.py 1
$ python train.py 2

Note that in the original repository, prepro.py is referred to as prepo.py . My inner perfectionist could not tolerate this, so I renamed it.

DCTTS contains only convolutional layers, and unlike RNN implementations like Tacotron, it learns much faster.

On my machine with Intel Core i5-4670, 16 Gb RAM and GeForce 1080 on board, 50 thousand steps for Text2Mel learns in 15 hours, and 75 thousand steps for SSRN in 5 hours. The time required for a thousand steps in the learning process, I almost did not change, so you can easily figure out how much time it takes to learn with a lot of steps.

The size of the batch can be adjusted with hp.B. From time to time, the learning process fell with out-of-memory, so I just divided the patch size into 2 and restarted learning from scratch. I believe that the problem lies somewhere in the bowels of TensorFlow (I did not use the latest) and the intricacies of the implementation of batching. I did not deal with this, since at a value of 8 everything stopped falling.

Result

After the models have trained, you can finally start the synthesis. To do this, fill out the file with phrases and run:

$ python synthesize.py

I adjusted the implementation a bit to generate phrases from the desired file.

The results in the form of wave files will be saved in the samples directory . Here are examples of the synthesis system that I got:

Conclusions and remarks

The result exceeded my personal expectations for quality. The system places stress, the speech is legible, and the voice is recognizable. In general, it turned out not bad for the first version, especially considering that only 5 hours of training data were used for training.

There remain questions about the controllability of such a synthesis. It is not even possible to correct the stress in a word if it is incorrect. We are strictly tied to the maximum phrase length and the size of the chalk spectrogram. There is no way to control intonation and playback speed.

I did not post my changes in the code of the original implementation. They concerned only the loading of training data and phrases for synthesis according to the finished system, as well as the values of hyperparameters: alphabet ( hp.vocab ) and the size of the batch ( hp.B) The rest of the implementation remained original.

As part of the story, I did not touch at all on the topic of production of the implementation of such systems, it is still very far from completely E2E speech synthesis systems. I used a GPU with CUDA, but even then everything is slower than real time. Everything just works indecently slowly on the CPU.

All these issues will be addressed in the coming years by large companies and scientific communities. I am sure that it will be very interesting.

Tags: