Recognition and conversion of subtitles from VOB to SRT format

In this article, I would like to touch on one important aspect faced by fans of watching movies and video products in the original language, who want to copy subtitles from the original DVD disc to watch a movie in the original language. Agree that the best translation in most cases loses to the original soundtrack.

As you know, on DVD discs subtitles are presented in a pre-rendered format, which makes it impossible to edit or translate them. The existing automated conversion utilities are not only oriented to an English-speaking audience, in addition to doing their job pretty poorly, there are a lot of errors in the recognized text. Having taken care of this question, in one evening I developed and successfully tested a simple technique and script in Perl, which I bring to your attention.

We will need the following programs: FineReader, SubRip and the Perl interpreter to execute a script for assembling subtitles from text files recognized by FineReader. Where to get them will tell you Yandex or Google, all of the listed programs are widely known.

So, we are starting.
1. Run the SubRib utility and open the VOB with the suffix * _0.VOB of the desired video sequence file. Select the subtitle track you need, as shown in the screenshot below. Select the “save subtictures as BMP” option.



2. Click the Start button. Select the directory where SubRip will save the extracted subtitle pictures in BMP format, then specify the file prefix, the number and extension BMP will be added to SubRip automatically. Then select a subtitle rendering style as shown below. From my own experience, I recommend choosing a black and white custom scheme, resetting the values ​​of the Color 1 and Color3 parameters, and setting the minimum values ​​for the Color2 and Color4 parameters.



3. Wait for SubRip to extract the images from the VOB files and create the BMP format images in the directory you previously selected. The saving process will be shown in a new window opened by the application.



4. After the process is completed, save the generated SubRip file with subtitle timings, it will be visible in a new window that opens during BMP image generation. Select ASCII format.



5. That's it, now we have subtitles and a file with timing on hand. Time to open FineReader. We launch FineReader, select the recognition languages ​​present in the subtitles (if there are more than one), select the option “open PDF or images”, use CTRL-A to select all the images from our catalog in the dialog. Before opening images, indicates recognition options. The configuration of options is shown in two screenshots below.



To simplify the process, you can use only built-in templates either, but if you want to control the recognition process using your own templates, select the second option.



6. After recognizing the text and checking it, you need to save the result. Since FineReader did not always correctly recognize the end of the paragraph in subtitles, according to the results of the experiments, I chose the option to save to separate files.
The type of file to be saved (we save to a text file) is shown in the screenshot below:



When saving, select a directory, specify a prefix for text files, select "create a separate file for each page" from the drop-down menu, then click on the "options" button


and specify the save options as shown below.



7. As a result of all the above actions, we got a directory with many text files in UTF-8 encoding. Now we need to convert them. To do this, I wrote a small script for assembling subtitles based on the previously saved in step 4 and many text files. To do this, save the Perl script shown below or download the executable file of the compiled version of the script and run with two parameters,
--subtutles full path and directory name with text files
--timing full path and timing file name.

#!/usr/bin/perl
use strict;
use warnings;
use Getopt::Long;
use File::GLob;
use utf8;
#perl2exe_include "unicore/Heavy.pl"
#perl2exe_include "overloading.pm"
#perl2exe_include "File/Glob.pm"
#------------------------------------------------------------------
my ($arg_subtitles,$arg_timing);
GetOptions("subtitles=s"=> \$arg_subtitles,
            "timing=s"=> \$arg_timing);
usage() if (!$arg_subtitles || !$arg_timing);
$arg_subtitles =~ s#[/\\]#\\\\#g;
$arg_timing =~ s#[/\\]#\\\\#g;
my $buf = "";
my @subs_array;
while (<$arg_subtitles/*.txt>){
 my $fname = $_;
 my $sub_number = $1 if ($fname =~ /^.*?0{0,5}(\d{1,5})\.txt$/);
 local $/;  
 open (sFILE,$fname) or die "Can't read file $fname [$!]\n";  
 $buf = ; $buf =~ s/\xEF\xBB\xBF//;
 close (sFILE);
 $subs_array[$sub_number]=$buf;
}
open(tFILE, "<".$arg_timing) or die "Can't read file $arg_timing [$!]\n";
print "\xEF\xBB\xBF";
while () {
if (m/(\d{2,2}:\d{2,2}:\d{2,2}):(\d{2,2}) (\d{2,2}:\d{2,2}:\d{2,2}):(\d{2,2}) \S+(\d{5,5})\.\w{3,3}/) {
   my $start_hms= $1; my $start_mls= $2; my $end_hms=$3; my $end_mls=$4; my $sub_number = $5;
   $sub_number =~ s/^0{0,4}//;
   print "$sub_number\n$start_hms,$start_mls"."0"." --> $end_hms,$end_mls"."0"."\n".$subs_array[$sub_number]."\n\n";
 } 
}
close (tFILE);  
sub usage
{
    die <<"EOT";
Usage: $0 --subtitles path_to_the_subs_folder --timing path_to_the_timing_file
path_to_the_subs_folder is the name of the folder where recognised subtitles are stored
while saving recognised subtitles from BMP images, choose text format and "store one file per page" options
EOT
}

The script displays the created file in UTF8 format to the console, so you can redirect it to the file of your choice.
That's all, thank you for your attention.

Also popular now: