Huge open dataset of Russian speech

    image

    Speech recognition specialists have long lacked a large open corpus of oral Russian speech, so only large companies could afford to do this task, but they were in no hurry to share their best practices.

    We are in a hurry to fix this lasting misunderstanding for years.

    So, we bring to your attention a data set of 4000 hours of annotated oral speech, collected from various Internet sources.

    Details under the cut.

    Here is the data for the current version 0.3:
    Data typeannotationQualityPhrasesClockGB
    Booksalignment95% / pure1,1M1,511166
    CallsASR70% / noisy837K81289
    Generated (Russian addresses)Tts100% / 4 votes1.7M75481
    Speech from YouTube videosubtitles95% / noisy786K72478
    BooksASR70% / noisy124K116thirteen
    Other datasetsreading and alignment99% / pure17K435

    And here is a link to the website of our building .

    Will we develop the project further?


    Our work on this is not finished, we want to get at least 10 thousand hours of annotated speech.

    And then we are going to make open and commercial models for speech recognition using this dataset. And we suggest you join: help us improve the dataset, use it in our tasks.

    Why is our goal 10 thousand hours?


    There are various studies of generalization of neural networks in speech recognition, but it is known that good generalization does not work on data sets of less than 1000 hours. A figure of the order of 10 thousand hours is already considered acceptable in most cases, and then it already depends on the specific task.

    What else can be done to improve the quality of recognition if the data is still not enough?


    Often, you can adapt the neural network to your speakers through a narration of text announcers.
    You can also adapt the neural network to a dictionary from your subject area (language model).

    How did we make this dataset?


    • Found channels with high-quality subtitles on YouTube, downloaded audio and subtitles
    • Gave audio for recognition to other speech recognition systems
    • We read addresses with robotic voices
    • We found audio books and book texts on the Internet, then broke them into fragments by pauses and compared one to another (the so-called “alignment” task)
    • Added on the Internet small Russian datasets.
    • After that, the files were converted into a single format (16-bit wav, 16 kHz, mono, hierarchical arrangement of files on disk).
    • The metadata was stored in a separate manifest.csv file.

    How to use it:


    File db


    The location of files is determined by their hashes, like this:

    target_format = 'wav'
    wavb = wav.tobytes()
    f_hash = hashlib.sha1(wavb).hexdigest()
    store_path = Path(root_folder,
                      f_hash[0],
                      f_hash[1:3],
                      f_hash[3:15]+'.'+target_format)
    

    Reading files


    from utils.open_stt_utils import read_manifest
    from scipy.io import wavfile
    from pathlib import Path
    manifest_df = read_manifest('path/to/manifest.csv')
    for info in manifest_df.itertuples():
        sample_rate, sound = wavfile.read(info.wav_path)
        text = Path(info.text_path).read_text()
        duration = info.duration
    

    The manifest files contain triples: the name of the audio file, the name of the file with the text description, and the duration of the phrase in seconds.

    Filter files of only a certain length


    from utils.open_stt_utils import (plain_merge_manifests,
                                      check_files,
                                      save_manifest)
    train_manifests = [
     'path/to/manifest1.csv',
     'path/to/manifest2.csv',
    ]
    train_manifest = plain_merge_manifests(train_manifests,
                                            MIN_DURATION=0.1,
                                            MAX_DURATION=100)
    check_files(train_manifest)
    save_manifest(train_manifest,
                 'my_manifest.csv')
    

    What to read or look in Russian to better get acquainted with the task of speech recognition?


    Recently, as part of the Deep Learning course on fingers, we recorded a lecture on the problem of speech recognition (and a little about synthesis). Perhaps she will be useful to you!


    Licensing issues


    • We post the dataset under a double license: for non-commercial purposes we offer a cc-by-nc 4.0 license , for commercial purposes - use after agreement with us.
    • As usual in such cases, all rights to use the data included in the dataset remain with their owners. Our rights apply to the dataset itself. Separate rules apply for scientific and educational purposes, see the legislation of your country.

    Once again , the project site for those who did not see the link above .

    Also popular now: