On the way to professional use of modern OCR. Understanding finereader

    I am developing technologies used in ABBYY's text recognition products. The most famous product (or rather, the product family) using these technologies is FineReader.

    What do I mean by "technology"
    Sometimes all technological modules (parts of the program that are invisible to the user) are collectively called the “recognition engine” (“engine” - from the English “Engine”), which is not entirely true - they perform not only character recognition, but also a bunch of other actions, more about which below .


    What does the FineReader program do?


    Now, any of the FineReader desktop options can do everything independently from receiving an image from a scanner, camera, or from a finished file to outputting the processing result to a file or to a specified application, so that the person remains behind the scenes. The program itself “recognizes” everything that is needed (in quotation marks, since in this case the program determines the location of the text, tables, pictures, OCRit the detected areas with the displayed text, forms a document that saves in the desired format with the specified settings)
    A couple of screenshots




    What does the user do?



    Usually almost nothing - first orders the work, and then accepts it. Sometimes a user is not happy with something as a result of automatic processing, but in such cases a typical user humbly thinks “No luck ...”

    Unfortunately, not everyone knows that there are others besides the “Task” window, which is also displayed at startup ways to manage the program. They help with the help of human intelligence to overcome the shortcomings and limitations (sometimes fundamental) of artificial intelligence programs.

    How to learn to do it? There are several ways, if necessary combined:
    • read the “ Quick Guide ”, “The Complete User Guide ”, the online Help for the program - of course there are a lot of letters, but almost all of them are written on the case.
    • read this article to the end. There are much fewer letters in it, in addition, the author promises to rid the reader of the fear of the program and arouse his interest in experiments,
    • experimenting with the program (the only point you can’t do without) - even the demo version allows you to try everything you need in real use.


    Where to begin?


    You need to start with the habit of saving the result of work not only as a document in the target format, but also as a FineReader document containing the results of the work done. This allows working with a large document not several hours in a row in one approach, but when it is convenient and as many times as necessary, return to the recognized and subtracted document for experiments with save settings and so on. All actions with the FineReader document are collected in the File menu.

    Images


    There is nothing more practical than a good theory, or what “recognition” consists of.


    Looking at the concise names of tasks, for example, “Scan to PDF”, it is difficult to imagine how much happens between “Scan” and “PDF” (that is, in place of one letter “c”). Let's see how much.
    The task of “converting documents from a raster representation to an editable one” (not just “recognition”) includes the following main steps:

    1. Getting the original single or multi-page image (from the scanner, camera or as a file), converting it into a special internal representation (to simplify and speed up further operations). In any case, the image processing subsystem is used , which understands many external formats for both reading and writing.

    2. Image preparation (correction of distortions of various types, separation of book spreads on separate pages - all this is on / off in the settings) - the image processing subsystem also performs . You can learn more about some elements of this process in this post .

    3. Segmentation, or “analysis of the page layout”, when it is decided where and what needs and does not need to be recognized, is performed by the Analysis subsystem .

    4. Recognition (finally) - the Recognizer subsystem performs (surprise!), It generates lines consisting of fragments (future words), consisting of characters without formatting (so far there is not even division into paragraphs, there are only lines). A certain amount of information about the details of the recognizer was already written on Habré by my colleague. And if you are really interested in technical details, then it will not be out of place to mention that the recognizer in its work uses, among other things, a subsystem of morphology . In this post, you can learn to use the mentioned subsystem of morphology and the Recognition with learning mechanism correctly., which allows better recognition of decorative fonts or characters that FineReader knows nothing about (it happens sometimes).

    5. Synthesis of a document (it has two stages - a page, called immediately after recognition of a single page and a document, which works after processing all pages) - this is where the structure and all characteristics of the recognized text, except for character codes, are determined, generating a complete document - the Synthesis subsystem performs . In this post, you can try to realize the difficult fate of those who write those hundreds of hundreds of heuristics that allow you to make the recognized document as similar to the original as possible.

    6. Viewing and editing page images, structure of regions, recognition results is performed by the Program Shell and the Editor subsystem in its composition (the FineReader.exe executable file is a shell). In the shell, you can see and edit a significant part of the information generated during processing (starting with the structure of the blocks). Of course, not all information that is operated by different subsystems is available for user editing, primarily because displaying all the entities found by automation, their properties and relationships would cause a crazy complication of the user interface.

    7. Saving the finished document to numerous external formats is performed by the Export subsystem (the development of which I am engaged in just with my colleagues).
      Subsystems that work before export do not know the output format / save option. Therefore, when synthesizing a document, several of its representations are created at once, which may be required by all export formats / options, and the shell can show them in the same way as the export results will be displayed in target applications. This gives rise to many difficulties in the development, because too close interconnection of the indicated subsystems leads to a complication of the division of responsibilities in the "border territories" when a bug / feature lies somewhere between the subsystems. But we can handle it for now :)


    Why are there so many modules (subsystems)?


    First you need to note that only the main ones are listed, and not all. The scanning subsystem, for example, was not written for a day or two, but for many months and even, possibly, years. However, we return to the question identified above.

    Firstly, the project “Recognition Technologies” and many complex products based on it have been developed for many decades by large teams of people - their work just needs to be divided organizationally and technologically into parts in order to develop each more or less independently - of course, describing the interfaces in detail and rules for the interaction of modules, so that the output of the previous module in the chain is docked with the input of the next.

    Secondly, some products may not use all of the listed processing stages (and the subsystems that implement them), but only some. For example, the Recognizer module has its own submodules for processing printed and handwritten text, and its “printed” sub- module also has its own sub- modules for processing languages ​​with complex writing. A similar situation with the barcode recognition module and codecs of some image formats - some products do without them.

    What is the result and why is the user needed?


    Not being puzzled by this question in time, you can remain dissatisfied with even the completely correct OCR result in the narrow sense - when it seems like all the letters were found and correctly recognized, but in general, something sad as a result.
    I will list some of the popular scenarios for using FineReader with the features of each scenario.

    Converting the archive of image documents into electronic form, with the maximum preservation of the appearance of the pages, but by adding the ability to search and copy small fragments of text.

    This scenario usually uses the saving of the processed document in PDF with a visible image of the page (not always in its original form, but as similar as possible) and adding “invisible” recognized text that can be searched, highlighted and copied in PDF viewers. In our jargon, this save mode in PDF is called “Text under the image”, it is the most popular, but this is only one of the 4 save modes in PDF (I will dwell on the rest in the article about saving). Connoisseurs of the DjVu format can also use a similar save mode.

    An important advantage of the "Text under the image" mode is that it requires minimal knowledge about the structure of the saved text, tying the characters to the right places in the resulting page just by the coordinates on the original image. Therefore, it doesn’t matter if the tables were not correctly automatically detected in the original (falling into a bunch of text areas), or the text was a little illogical in the text areas - in the resulting PDF there is everything or almost everything, if only the characters were correctly recognized and gathered into words.

    Creating a document in the format of any of the popular text editors (Microsoft Word or OpenOffice / LibreOffice Writer), more or less similar to the original - for subsequent editing and / or reuse of significant fragments in new documents.

    When saving to RTF and DOCX (for Word) and ODT (for Writer) formats, 4 save modes are supported, which differ in the balance “exact view saving <-> easy to edit and copy content”. I will write in more detail about their differences, but the general requirement for a reasonable appearance of the processing result is the reasonableness of marking up all the elements of the document in FR - areas and their properties.

    Create an e-book based on a scanned paper book.

    In many ways, it is similar to the previous one, but due to the simplified model of the document in e-book formats, the limitations of the means for editing and displaying them after FineReader, sometimes it requires more attention to some little things.

    And why do I know that now?


    As you probably already guessed, an understanding of these logical, but still not obvious points would allow users to bring the result of FineReader to an ideal (from the user's point of view) state with minimal effort. In the next part of the post I will give specific recommendations for solving typical user problems, but for now, let's get back to work.

    Also popular now: