Tesseract. Recognize errors in the recognition system

    Tesseract is a freeware text recognition software developed by Google. The project description says: "Tesseract is probably the most accurate open source OCR engine available." And let's try whether the PVS-Studio static analyzer can recognize any errors in this project.


    Tesseract is a free text recognition computer program developed by Hewlett-Packard from the mid-1980s to the mid-1990s, and then “lying on the shelf” for 10 years. In August 2006, Google bought it and opened the source code under the Apache 2.0 license to continue development. At the moment, the program is already working with UTF-8, language support (including Russian) is carried out using additional modules. [ description taken from Wikipedia]

    The source code of the project is available on the Google Code website: https://code.google.com/p/tesseract-ocr/

    The size of the source code is about 16 megabytes.

    Validation Results

    Here are some code snippets that I noticed when looking at the PVS-Studio report. Maybe I missed something. Therefore, it is advisable for the creators of Tesseract to conduct a check on their own. The trial version is active for 7 days, which is more than enough for such a small project. Well, then they decide whether they want to regularly use the tool and find typos, or not.

    As always, let me remind you. The essence of static analysis is not in one-time checks, but in regular use.

    Bad division

    void LanguageModel::FillConsistencyInfo(....)
      float gap_ratio = expected_gap / actual_gap;
      if (gap_ratio < 1/2 || gap_ratio > 2) {

    PVS-Studio Warnings: V636 The '1/2' expression was implicitly casted from 'int' type to 'float' type. Consider utilizing an explicit type cast to avoid the loss of a fractional part. An example: double A = (double) (X) / Y ;. language_model.cpp 1163

    They want to compare the variable 'gap_ratio' with a value of 0.5. Unfortunately, an unsuccessful way to write 0.5 was chosen. Division 1/2 is an integer and gives the result 0. The

    correct code should be like this:
    if (gap_ratio < 1.0f/2 || gap_ratio > 2) {

    Or like this:
    if (gap_ratio < 0.5f || gap_ratio > 2) {

    There are other places where suspicious integer division is performed. Perhaps among them there are really unpleasant mistakes.

    Code snippets worth checking out:
    • baselinedetect.cpp 110
    • bmp_8.cpp 983
    • cjkpitch.cpp 553
    • cjkpitch.cpp 564
    • mfoutline.cpp 392
    • mfoutline.cpp 393
    • normalis.cpp 454

    Typo in comparison

    uintmax_t streamtoumax(FILE* s, int base) {
      int d, c = 0;
      c = fgetc(s);
      if (c == 'x' && c == 'X') c = fgetc(s);

    PVS-Studio warning: V547 Expression 'c ==' x '&& c ==' X '' is always false. Probably the '||' operator should be used here. scanutils.cpp 135 The

    correct check option:
    if (c == 'x' || c == 'X') c = fgetc(s);

    Undefined behavior

    One interesting design was discovered. I haven’t seen this before:
    void TabVector::Evaluate(....) {
      int num_deleted_boxes = 0;
      ++num_deleted_boxes = true;

    Warning PVS-Studio: V567 Undefined behavior. The 'num_deleted_boxes' variable is modified while being used twice between sequence points. tabvector.cpp 735

    It is not clear what the author wanted to say with this code. Most likely this code is a consequence of a typo.

    The result of the expression cannot be predicted. The variable 'num_deleted_boxes' can be increased both before and after assignment. The reason is that the variable changes twice at the same point of succession .

    Other errors leading to undefined behavior are associated with the use of shifts . Consider an example:
    void Dawg::init(....)
      letter_mask_ = ~(~0 << flag_start_bit_);

    Warning V610 Undefined behavior. Check the shift operator '<<. The left operand '~ 0' is negative. dawg.cpp 187

    The expression '~ 0' is of type 'int' and is equal to the value '-1'. Shifting negative values ​​results in undefined behavior. That the program can work correctly is luck and nothing more. You can fix the flaw by making '0' unsigned:
    letter_mask_ = ~(~0u << flag_start_bit_);

    However, that is not all. The analyzer generates another warning on this line:

    V629 Consider inspecting the '~ 0 << flag_start_bit_' expression. Bit shifting of the 32-bit value with a subsequent expansion to the 64-bit type. dawg.cpp 187

    The fact is that the variable 'letter_mask_' is of type 'uinT64'. As I understand it, it may be necessary to write units to the high 32 bits. In this case, the created expression is incorrect. It only works with low bits.

    It is necessary to make so that '0' is a 64-bit type. Corrected version:
    letter_mask_ = ~(~0ull << flag_start_bit_);

    I’ll list a list of other code fragments where negative numbers are shifted:
    • dawg.cpp 188
    • intmatcher.cpp 172
    • intmatcher.cpp 174
    • intmatcher.cpp 176
    • intmatcher.cpp 178
    • intmatcher.cpp 180
    • intmatcher.cpp 182
    • intmatcher.cpp 184
    • intmatcher.cpp 186
    • intmatcher.cpp 188
    • intmatcher.cpp 190
    • intmatcher.cpp 192
    • intmatcher.cpp 194
    • intmatcher.cpp 196
    • intmatcher.cpp 198
    • intmatcher.cpp 200
    • intmatcher.cpp 202
    • intmatcher.cpp 323
    • intmatcher.cpp 347
    • intmatcher.cpp 366

    Suspicious Double Assignment

    TESSLINE* ApproximateOutline(....) {
      EDGEPT *edgept;
      edgept = edgesteps_to_edgepts(c_outline, edgepts);
      fix2(edgepts, area);
      edgept = poly2 (edgepts, area);  // 2nd approximation.

    PVS-Studio Warning: V519 The 'edgept' variable is assigned values ​​twice successively. Perhaps this is a mistake. Check lines: 76, 78. polyaprx.cpp 78

    Another similar case:
    inT32 row_words2(....)
      this_valid = blob_box.width () >= min_width;
      this_valid = TRUE;

    PVS-Studio Warning: V519 The 'this_valid' variable is assigned values ​​twice successively. Perhaps this is a mistake. Check lines: 396, 397. wordseg.cpp 397

    Invalid class member initialization sequence

    First, consider the 'MasterTrainer' class. Note that the class member 'samples_' is located before the member 'fontinfo_table_':
    class MasterTrainer {
      TrainingSampleSet samples_;
      FontInfoTable fontinfo_table_;

    According to the standard, the order of initialization of class members in the constructor occurs in the order of their declaration in the class. This means that 'samples_' will be initialized BEFORE the initialization of 'fontinfo_table_'.

    Now consider the constructor:
    MasterTrainer::MasterTrainer(NormalizationMode norm_mode,
                                 bool shape_analysis,
                                 bool replicate_samples,
                                 int debug_level)
      : norm_mode_(norm_mode), samples_(fontinfo_table_),
        fragments_(NULL), prev_unichar_id_(-1),

    The trouble is that the uninitialized variable 'fontinfo_table_' is also used to initialize 'samples_'.

    A similar situation in this class with initialization of the fields 'junk_samples_' and 'verify_samples_'.

    I do not presume to say what is best to do with this class. It might be enough to move the declaration of 'fontinfo_table_' to the very beginning of the class.

    A typo in the condition

    A typo is not easy to notice, but the analyzer does not know fatigue.
    class ScriptDetector {
      int korean_id_;
      int japanese_id_;
      int katakana_id_;
      int hiragana_id_;
      int han_id_;
      int hangul_id_;
      int latin_id_;
      int fraktur_id_;
    void ScriptDetector::detect_blob(BLOB_CHOICE_LIST* scores) {
      if (prev_id == katakana_id_)
        osr_->scripts_na[i][japanese_id_] += 1.0;
      if (prev_id == hiragana_id_)
        osr_->scripts_na[i][japanese_id_] += 1.0;
      if (prev_id == hangul_id_)
        osr_->scripts_na[i][korean_id_] += 1.0;
      if (prev_id == han_id_)
        osr_->scripts_na[i][korean_id_] += kHanRatioInKorean;
      if (prev_id == han_id_)             <<<<====
        osr_->scripts_na[i][japanese_id_] += kHanRatioInJapanese;

    PVS-Studio Warning: V581 The conditional expressions of the 'if' operators located alongside each other are identical. Check lines: 551, 553. osdetect.cpp 553

    Most likely, the most recent comparison should be like this:
    if (prev_id == japanese_id_)

    Unnecessary checks

    There is no need to check what the 'new' operator returns. If memory allocation fails, an exception will be thrown. Of course, you can make a special operator 'new', which returns null pointers, but this is a separate case ( details ).

    As a result, this function can be simplified:
    void SetLabel(char_32 label) {
      if (label32_ != NULL) {
        delete []label32_;
      label32_ = new char_32[2];
      if (label32_ != NULL) {
        label32_[0] = label;
        label32_[1] = 0;

    PVS-Studio Warning: V668 There is no sense in testing the 'label32_' pointer against null, as the memory was allocated using the 'new' operator. The exception will be generated in the case of memory allocation error. char_samp.h 73

    There are 101 more places where the pointer that the 'new' operator returned is checked. To list them in the article, I do not see the point. It’s easier to run PVS-Studio for this.


    Use static analysis regularly, and you will save a lot of time solving more useful tasks than catching silly errors and typos.

    And don't forget to follow me on Twitter: @Code_Analysis . I regularly publish links to interesting C ++ related articles.

    Have you read the article and have a question?
    Often our articles are asked the same questions. We collected answers to them here: Answers to questions from readers of articles about PVS-Studio and CppCat, version 2014 . Please see the list.

    Also popular now: