Random forest vs neural networks: who will better cope with the task of recognizing sex in speech (Part 1)

    Historically, deep learning has achieved the greatest success in image processing tasks - recognition, segmentation and image processing. However, not convolutional networks, as they say, live on the science of data.

    We tried to compile a guide for solving problems related to speech processing. The most popular and demanded of them is probably recognition of what exactly they say, analysis at the semantic level, but we will turn to a simpler task - determining the gender of the speaker. However, the toolkit in both cases is almost the same. / Photo justin lincoln / CC-BY

    What our algorithm “hears”

    Characteristics of the voice that we will use

    The first step is to understand the physics of processes - to understand how the male voice differs from the female. You can read about the structure of the vocal tract in people in reviews or in specialized literature, but the basic explanation “on the fingers” is quite transparent: the vocal cords, the vibrations of which produce sound waves before modulation by other organs of speech, have different thicknesses and tension in men and women, which leads to a different frequency of the fundamental tone (it is also pitch, pitch). In men, it is usually in the range of 65-260 Hz, and in women - 100-525 Hz. In other words, the male voice most often sounds below the female.

    It is naive to suppose that pitch alone may be enough. It can be seen that for both sexes these intervals overlap rather strongly. Moreover, in the process of speech, the frequency of the fundamental tone - a variable parameter that changes, for example, when transmitting intonation - cannot be determined for many consonants, and its calculation algorithms are not perfect.

    In human perception, the individuality of the voice is contained not only in the frequency, but also in the timbre - the totality of all frequencies in the voice. In a sense, it can be described using the spectrum, and here mathematics comes to the rescue.

    Sound is a volatile signal, which means its spectrum from the point of view of average time is unlikely to give us anything meaningful, so it is reasonable to consider a spectrogram - a spectrum at each moment of time, as well as its statistics. The sound signal is divided into overlapping 25-50 millisecond segments - frames, for each of which, using the fast Fourier transform, the spectrum is calculated, and for it, moments are then sought. Most often they use centroid, entropy, dispersion, asymmetry coefficients and kurtosis - there are a lot of things that are resorted to when calculating random variables and time series.

    Also using shallow frequency cepstral coefficients (MFCC). Read about them, for example, here. The problems they are trying to solve are two. Firstly, the perception of sound by a person is not linear in frequency and in signal amplitude; therefore, some scaling (logarithmic) is required. Secondly, the spectrum of the speech signal itself varies in frequency quite smoothly, and its description can be reduced to several numbers without any particular loss in accuracy. As a rule, 12 small-cepstral coefficients are used, each of which is a logarithm of the spectral density within certain frequency bands (its width is higher, the higher the frequency).

    It is this set of features (pitch, spectrogram statistics, MFCC) that we will use for classification.

    / Photo of Daniel Oines / CC-BY

    We solve the classification problem

    Machine learning starts with data. Unfortunately, there are no open and popular bases for identifying gender, such as ImageNet for classifying images or IMDB for tonality of texts. You can take the well-known TIMIT speech recognition database, but it is paid (which imposes some restrictions on its public use), so we will use the VCTK - 7 GB database in the public domain. It is intended for the task of speech synthesis, but suits us in all respects: there is sound and data on 109 speakers. For each of them, we take 4 random statements lasting 1-5 seconds and try to determine the gender of their author.

    In a computer, sound is output as a sequence of numbers - the deviations of the microphone membrane from its equilibrium position. The sampling frequency is most often chosen from the range from 8 to 96 kHz, and for a single-channel sound, one second of it will be represented by at least 8 thousand numbers: any of them encodes the deviation of the membrane from the equilibrium position at each of eight thousand times per second. For those who have heard about Wavenet - a neural network architecture for synthesizing an audio signal - this does not seem to be a problem, but in our case this approach is redundant. The logical action at the stage of data preprocessing is the calculation of features that can significantly reduce the number of parameters that describe the sound. Here we turned to openSMILE- A convenient package that can calculate almost everything related to the sound.

    The code is written in Python, and the implementation, Random Forest, which handled the classification best, was taken from the sklearn library. It is also curious to see how neural networks will cope with this task, but we will do a separate post about them.

    To solve the classification problem means building a function on the training data that returns the class label using similar parameters, and does it quite accurately. In our case, it is necessary that according to the set of features for an arbitrary audio file, our classifier should answer, whose speech is recorded in it, men or women.

    An audio file consists of many frames, usually their number is much larger than the number of training examples. If we study on the totality of frames, then we hardly get anything worthwhile - it is reasonable to reduce the number of parameters. In principle, it is possible to classify each frame individually, but due to outliers, the final result will also not be very encouraging. The golden mean is the calculation of characteristic statistics for all frames in the audio file.

    In addition, we need a classifier validation algorithm - we need to make sure that it does everything correctly. In speech processing tasks, it is considered that the generalizing ability of a model is low if it does not work well for all speakers, but only for those on which it was trained. Otherwise, they say that the model is the so-called speaker free, and that in itself is not bad. To verify this fact, it is enough to divide the speakers into groups: on some to learn, and on the others - to check the accuracy.

    So we will do.

    The table with the data is stored in the data.csv file, the columns are signed in the first row, if desired, it can be displayed or viewed manually.

    We connect the necessary libraries, read the data:

    import csv, os
    import numpy as np
    from sklearn.ensemble import RandomForestClassifier as RFC
    from sklearn.model_selection import GroupKFold
    # read data
    with open('data.csv', 'r')as c:
    	r = csv.reader(c, delimiter=',')
    	header = r.next()
    	data = []
    	for row in r:
    data = np.array(data)
    # preprocess
    genders = data[:, 0].astype(int)
    speakers = data[:, 1].astype(int)
    filenames = data[:, 2]
    times = data[:, 3].astype(float)
    pitch = data[:, 4:5].astype(float)
    features = data[:, 4:].astype(float)

    Now we need to organize the cross-validation procedure for speakers. The GroupKFold iterator built into sklearn works as follows: each point in the sample refers to a group, in our case, to one of the speakers. Many of the speakers are divided into equal parts and each of them is sequentially excluded, the classifier is trained on the remaining ones and the accuracy on the ejected one is remembered. The accuracy of the classifier is taken as the average accuracy in all parts.

    def subject_cross_validation(clf, x, y, subj, folds):
    	gkf = GroupKFold(n_splits=folds)
    	scores = []
    	for train, test in gkf.split(x, y, groups=subj):
    		clf.fit(x[train], y[train])
    		scores.append(clf.score(x[test], y[test]))
    	return np.mean(scores)

    When everything is ready, you can experiment. First, let's try to classify frames. The classifier receives a feature vector at the input, and the output label matches the label of the file from which the current frame is taken. Compare the classification separately in frequency and in all respects (frequency + spectral + mfcc):

    # classify frames separately
    score_frames_pitch = subject_cross_validation(RFC(n_estimators=100), pitch, genders, speakers, 5) 
    print 'Frames classification on pitch, accuracy:', score_frames_pitch
    score_frames_features = subject_cross_validation(RFC(n_estimators=100), features, genders, speakers, 5) 
    print 'Frames classification on all features, accuracy:', score_frames_features

    As expected, we got a low accuracy - 66 and 73% of correctly classified frames. Not much, not much better than a random classifier, which would give about 50%. First of all, such low accuracy is associated with the presence of garbage in the sample - for 64% of the frames, it was not possible to calculate the frequency of the fundamental tone. There can be two reasons: the frames either did not contain speech at all (silence, sighs), or were components of consonant sounds. And if the former can be rejected with impunity, the latter - with reservations: we believe that by sound frames we will be able to correctly separate male and female speech.

    In fact, I want to classify not the frames, but the entire audio files. You can calculate a variety of statistics from time sequences of signs, and then classify them already:

    def make_sample(x, y, subj, names, statistics=[np.mean, np.std, np.median, np.min, np.max]):
    	avx = []
    	avy = []
    	avs = []
    	keys = np.unique(names)
    	for k in keys:
    		idx = names == k
    		v = []
    		for stat in statistics:
    			v += stat(x[idx], axis=0).tolist()
    	return np.array(avx), np.array(avy).astype(int), np.array(avs).astype(int)
    # average features for each frame
    average_features, average_genders, average_speakers = make_sample(features, genders, speakers, filenames)
    average_pitch, average_genders, average_speakers = make_sample(pitch, genders, speakers, filenames)

    Now each audio file is represented by a vector. We consider the mean, variance, median and extreme values ​​of attributes and classify them:

    # train models on pitch and on all features
    score_pitch = subject_cross_validation(RFC(n_estimators=100), average_pitch, average_genders, average_speakers, 5) 
    print 'Utterance classification on pitch, accuracy:', score_pitch
    score_features = subject_cross_validation(RFC(n_estimators=100), average_features, average_genders, average_speakers, 5) 
    print 'Utterance classification on features, accuracy:', score_features

    97.2% is a completely different matter, everything seems to be great. It remains to discard the garbage frames, recalculate statistics and enjoy the result:

    # skip all frames without pitch
    filter_idx = pitch[:, 0] > 1
    filtered_average_features, filtered_average_genders, filtered_average_speakers = make_sample(features[filter_idx], genders[filter_idx], speakers[filter_idx], filenames[filter_idx])
    score_filtered = subject_cross_validation(RFC(n_estimators=100), filtered_average_features, filtered_average_genders, filtered_average_speakers, 5) 
    print 'Utterance classification an averaged features over filtered frames, accuracy:', score_filtered

    Hooray, the bar at 98.4% is reached. By selecting the model parameters (and choosing the model itself), you can probably increase this number, but we will not get any qualitatively new knowledge.


    Machine learning in speech processing is objectively difficult. The decision “in the forehead” in most cases turns out to be far from what is desired, and often it is necessary to additionally “scrape” it for 1-2% accuracy, changing something that would seem insignificant, but justified by physics or mathematics. Strictly speaking, this process can be continued indefinitely, but ...

    In the next and last part of our introductory guide, we will examine in detail whether neural networks will cope with this task better, and study various experimental settings, network architectures, and related issues. Worked on the

    material :

    • Grigory Sterling, mathematician, leading Neurodata Lab expert on machine learning and data analysis
    • Eva Kazimirova, biologist, physiologist, Neurodata Lab expert in the field of acoustics, analysis of voice and speech signals

    Stay with us.

    Also popular now: