
Build Tesseract OCR under MinGW

In one of my projects I needed character recognition and my choice was tesseract ocr. There was already a similar article on Habré , but at the moment it is not relevant, during the installation it was not possible to repeat the author’s instructions exactly. This article describes the installation process for Tesseract OCR under MinGW.
At the moment, Google is developing tesseract ocr, this may mean that the library will develop in the near future. I will try to describe the installation process in as much detail as possible, starting with the installation of MinGW.
MinGW Installation
First we need to install MinGW, it can be downloaded from the official website of the project . During installation, you must select the following options:
- C ++ compiler
- MSYS Basic System
Step 1. Installing MinGW:


Do not forget to note the options we need: C ++ Compiler and MSYS Basic System.

Next, go to the Installation section and select Apply Chanages .

Next, select the selected Apply packages to install .

We wait until the end of the installation of MinGW packages and click Close .

After completing these steps, the installer can be closed.
Before installing additional packages, we need to add the MinGW directory to PATH.
To do this, go to the system properties.

Next, go to Advanced system settings -> Environment variables -> Create .
In the variable name, enter PATH . In the variable value, enter the path to the folder with mingw, in our case it is C: \ MinGW \ bin .

Now we need to install several packages using MinGW Shell , which we will need when building the Tesseract OCR library .
To open the MinGW Shell, go to the C: \ MinGW \ msys \ 1.0 folder and run the msys.bat binary file .

In the console that opens, enter the following command:
mingw-get install mingw32-automake mingw32-autoconf mingw32-autotools mingw32-libz

In this case, we were told that we have already installed these packages, it's okay, that's fine.
An important point ! We need to mount the folder with our MinGW in MinSYS.
To do this, follow these steps:
Create a file C: \ MinGW \ msys \ 1.0 \ etc \ fstab to mount the C: \ MinGW directory on the mount point / mingw:
# Win32_Path Mount_Point c: / MinGW / mingw

Well, after creating fstab you need to restart MinGW Shell, just close and reopen msys.bat .
Installing the Leptonica Library
After setting up MinGW, we need to install the Leptonica library . Tesseract ocr uses the Leptonica library to work with images, but before installing it, we need to install several auxiliary libraries:
- libJpeg
- libpng
- libTiff
Install LibJpeg
First, create a directory in which we will store our libraries, for example C: \ libs \ . In this directory, create a subfolder libjpeg to store the library. Now that we have prepared our workplace, you can download LibJpeg from the official site and unzip it to our folder C: \ libs \ libjpeg .
After unpacking, I got the following path to the libjpeg folder: C: \ libs \ libjpeg \ jpegsrc.v8c.tar \ jpegsrc.v8c \ jpeg-8c.

Now we need to build and install the library. To do this, go to MinGW Shell and enter the following commands:
cd /C/libs/libjpeg/jpegsrc.v8c.tar/jpegsrc.v8c/jpeg-8c/ ./configure CFLAGS = '- O2' CXXFLAGS = '- O2' --prefix = / mingw make make install




Excellent at this point, the libJpeg installation is complete.
Install libPng
Download the source code archive from the official site of the project and unpack it into the C : \ libs \ libpng directory . We return to MinGW Shell, the procedure for installing this library will be identical to installing libJpeg. After unpacking, I got the following directory: C: \ libs \ libpng \ libpng-1.5.4.tar \ libpng-1.5.4
cd /C/libs/libpng/libpng-1.5.4.tar/libpng-1.5.4/ ./configure CFLAGS = '- O2' CXXFLAGS = '- O2' --prefix = / mingw make make install
LibTiff assembly
The source code archive can be downloaded from the ftp server of the project. Unpack the archive in C : \ libs \ libtiff . The assembly of this library is similar to the assembly of the previous two libraries.
After unpacking, the following path turned out: C: \ libs \ libtiff \ tiff-3.9.5.tar \ tiff-3.9.5 .
cd /C/libs/libtiff/tiff-3.9.5.tar/tiff-3.9.5/ ./configure CFLAGS = '- O2' CXXFLAGS = '- O2' --prefix = / mingw make make install
Build Leptonica
After installing all the additional libraries, we go on to build Leptonica. First we need to download Leptonica 1.71 , this is important, we need version 1.71 . As tests have shown, if you take a version higher or lower, tesseract ocr itself will not be built. But in this version there is one bug that we have to fix. To get started, download the archive with source files from the official site . Unzip the downloaded archive into the C: / libs / leptonica / folder . After unpacking, I got the following path: C: \ libs \ leptonica \ leptonica-1.71.tar \ leptonica-1.71 .
The next step is to build the library, it is no different from the assembly of previous libraries:
cd /C/libs/leptonica/leptonica-1.71.tar/leptonica-1.71/ ./configure CFLAGS = '- O2' CXXFLAGS = '- O2' --prefix = / mingw make make install
Excellent. We pass to assembly of Tesseract OCR.
Tesseract OCR Build
After successfully assembling Leptonica, you can proceed with the assembly of the Tesseract OCR. Download the source code archive from the official site . Unpack the downloaded archive with the Tesseract OCR source code into the folder C: \ libs \ tesseract . After unpacking, I got the following path: C: \ libs \ tesseract \ tesseract-ocr-3.02.02.tar \ tesseract-ocr-3.02.02 \ tesseract-ocr .
Putting our Tesseracr OCR.
cd /C/libs/tesseract/tesseract-ocr-3.02.02.tar/tesseract-ocr-3.02.02/tesseract-ocr ./configure CFLAGS = '- D__MSW32__ -O2' CXXFLAGS = '- D__MSW32 __- O2' LIBS = '- lws2_32' LIBLEPT_HEADERSDIR = '/ mingw / include' --prefix = / mingw make make install
The assembly process Tesseracrt takes a lot of time, while you can go for a cup of tea.
Tesseract ocr header files will be in C: \ MinGW \ include \ tesseract , Leptonica header files in C: \ MinGW \ include \ leptonica , all libraries in C: \ mingw \ lib .
For the successful work of future programs, you should download and install the Tesseract ocr SDK from the official website .
It remains to fix the small bugs that appear after installing Tesseract OCR.
To do this, go to the folder with the Tesseract OCR header files C: \ MinGW \ include \ tesseract.
We comment in the platform.h filerepeated declaration of type BLOB. You should get something like the following:
/ * typedef struct _BLOB { unsigned int cbSize; char * pBlobData; } BLOB, * LPBLOB; * /
We comment on the declaration of the PBLOB class in baseapi.h .

Tesseract OCR Testing Application
After installing tesseract ocr, you can test it, write a simple C ++ application.
#include#include #include #include int main (int argc, char * argv []) { tesseract :: TessBaseAPI ocr; ocr.Init (NULL, "eng"); if (argc> 1) { PIX * pix = pixRead (argv [1]); ocr.SetImage (pix); std :: string result = ocr.GetUTF8Text (); std :: cout << "Recognition text:" << result << std :: endl; } else std :: cout << "Drag and drop image file to program for recognition" << std :: endl; return 0; }
You can build the application from the command line:
g ++ -O2 main.cpp -o ocr.exe -ltesseract -llept -lws2_32
