Yanovets November 2, 2016 at 15:40

How to convert pdf (images) to a text txt file

From the sandbox

You will say that the easiest way is to select all the text in pdf, copy it to the clipboard and paste it from the clipboard into a text file. And you will be right. But this is not our case. The pdf file is the result of scanning a multi-page document. Those. pdf contents are text images.

The proposed solution is implemented under Windows-8, but with minor corrections, I think it can be used for Linux and OS X.
Abbyy FineReader, MS Word, MS OneNote cope with the task of converting images to text. There are also sites where the image can be converted to online: http://www.ocrconvert.com The
proposed solution uses free utilities. The priority was also the work on the command line.

Convert all pdf pages to image files

If there were 2-3 pages, then you could use the PrintScreen function. In Windows, there is a separate button on the keyboard for this. And in Mac OS X - a tricky key combination: you need to press the three keys Shift + Command + 4, select the desired area of the screen with the mouse, and look for the resulting file on the desktop. But if there are many pages, then you need to look for another way.

Fortunately, there is a StduViewer program that allows you to do this. Go to File → Export → As Image. In the window that appears, select the PNG type, resolution 300 dpi, set the path where to put the resulting image files. In the saved file name template, it is worth changing% PN% to% 0PN% for the case if the pages are more than 10.

kolgrim99 offered a utility from the package to convert a pdf-document to jpg-filesxpdf , which can be used on the command line. Here is his suggestion:
<< If the task is simply to gut the large PDF file with scans (or any other pictures), then you can use the utility from the xpdf set, there is a lot of everything, but pdfimages.exe is needed for the pictures. The syntax is something like this:

pdfimages.exe -j some_file.pdf C: \ images \

and in the last argument at the end of the path must be set to '\', otherwise it will not be accepted. >>

Convert page image files to text

HP developed and Google opened the source code for tesseract libraries that convert images to text ( OCR ). Install the tesseract-ocr program .
To recognize the Russian language during installation, in the "Additional language data" check the box for Russian.

At the command prompt, execute commands like:

tesseract.exe image_01.png res_01.txt -l rus

We receive text files. You can run the command for each page manually. It is easier to execute a script in python:

import os, sys
import io
sPathIn = "D:/Pictures/pict"
sPathOut = "D:/Pictures/txt"
sCmd = "\"C:/Program Files (x86)/Tesseract-OCR/tesseract.exe\" {} {} -l rus"
os.system("cd \"C:/Program Files (x86)/Tesseract-OCR\"")
dirs = os.listdir( sPathIn )
for file in dirs:
 filename, file_ext = os.path.splitext(file)
 sCmdRes = sCmd.format(sPathIn + '/' + file, sPathOut + '/' + filename + ".txt")
 print ("run> " + sCmdRes)
 os.system(sCmdRes)

It turned out a bunch of text files, which remained to be combined into one. This can be done with pens. But it was easier to write a script in python:

import os, sys
import io
sPathIn = "D:/Pictures/txt"
sFileOut = "D:/Pictures/res.txt"
dirs = os.listdir( sPathIn )
for file in dirs:
 filename, file_ext = os.path.splitext(file)
 if (file_ext == ".txt"):
  fOut = open(sFileOut, "ab")
  f = open(sPathIn + "/" + file, "rb")
  data = f.read()
  fOut.write(data)
  f.close()
  fOut.close()

This could be completed, because basically, the text turned out to be quite readable, but in places in the text a mass of rims formed.
For example, a picture with text was

transformed into something like this:

management of the modeling process, including by means of a temporary interruption, intermediate saving and restarting of the modeling process from a suspended state, setting various initial conditions, entering failures of on-board systems, weather conditions, time of day, various disturbing factors (wind, turbulence, etc.);

Therefore, the next stage appeared.

Correction of errors in the text

We use the program LanguageTool . We are interested in working on the command line, so download the "independent version" . LanguageTool requires Java to work.

I started it from the native directory (for some reason it didn’t want to work on Windows 8.1 if the current directory was foreign) and indicated the full file names (with the directory). If you run a command at a command prompt, for example, this:

java -Dfile.encoding=UTF-8 -jar languagetool-commandline.jar --help

... then an additional console will start, where he honestly writes help and safely closes within a second. To see what it writes to the console, you need to run a command bat-file with this line inside. Perhaps java has some other command line parameter, so that extra does not start. console, but this is unknown to me.

The error correction command in the text file was as follows:

java -Dfile.encoding=UTF-8 -jar languagetool-commandline.jar -a -l ru original.txt > corrected.txt

To turn off the correction of small letters to large letters, additional parameters appeared at the beginning of the line --disablecategories CASING, and instead of the file name -% 1, to pass the name inside the bat-file as an argument. Total, the line in the bat-file turned out like this:

java -Dfile.encoding=UTF-8 -jar languagetool-commandline.jar -a -u --disablecategories CASING -l ru %1 > %1-res.txt

With the -u argument, the string “Unknown words:” is added to the end of the corrected text file with a comma-separated list of all words that LanguageTool does not know. Thus, you can improve the text by correcting the wrong words from this list.

Python 3.5 and PyCharm were used .
Thanks for attention!

Tags:

How to convert pdf (images) to a text txt file

Convert all pdf pages to image files

Convert page image files to text

Correction of errors in the text

Also popular now: