Andrey2008 May 22, 2014 at 12:22

Tesseract. Recognize errors in the recognition system

Tesseract is a freeware text recognition software developed by Google. The project description says: "Tesseract is probably the most accurate open source OCR engine available." And let's try whether the PVS-Studio static analyzer can recognize any errors in this project.

Tesseract

Tesseract is a free text recognition computer program developed by Hewlett-Packard from the mid-1980s to the mid-1990s, and then “lying on the shelf” for 10 years. In August 2006, Google bought it and opened the source code under the Apache 2.0 license to continue development. At the moment, the program is already working with UTF-8, language support (including Russian) is carried out using additional modules. [ description taken from Wikipedia]

The source code of the project is available on the Google Code website: https://code.google.com/p/tesseract-ocr/

The size of the source code is about 16 megabytes.

Validation Results

Here are some code snippets that I noticed when looking at the PVS-Studio report. Maybe I missed something. Therefore, it is advisable for the creators of Tesseract to conduct a check on their own. The trial version is active for 7 days, which is more than enough for such a small project. Well, then they decide whether they want to regularly use the tool and find typos, or not.

As always, let me remind you. The essence of static analysis is not in one-time checks, but in regular use.

Bad division

void LanguageModel::FillConsistencyInfo(....)
{
  ....
  float gap_ratio = expected_gap / actual_gap;
  if (gap_ratio < 1/2 || gap_ratio > 2) {
    consistency_info->num_inconsistent_spaces++;
  ....
}

PVS-Studio Warnings: V636 The '1/2' expression was implicitly casted from 'int' type to 'float' type. Consider utilizing an explicit type cast to avoid the loss of a fractional part. An example: double A = (double) (X) / Y ;. language_model.cpp 1163

They want to compare the variable 'gap_ratio' with a value of 0.5. Unfortunately, an unsuccessful way to write 0.5 was chosen. Division 1/2 is an integer and gives the result 0. The

correct code should be like this:

if (gap_ratio < 1.0f/2 || gap_ratio > 2) {

Or like this:

if (gap_ratio < 0.5f || gap_ratio > 2) {

There are other places where suspicious integer division is performed. Perhaps among them there are really unpleasant mistakes.

Code snippets worth checking out:

baselinedetect.cpp 110
bmp_8.cpp 983
cjkpitch.cpp 553
cjkpitch.cpp 564
mfoutline.cpp 392
mfoutline.cpp 393
normalis.cpp 454

Typo in comparison

uintmax_t streamtoumax(FILE* s, int base) {
  int d, c = 0;
  ....
  c = fgetc(s);
  if (c == 'x' && c == 'X') c = fgetc(s);
  ....
}

PVS-Studio warning: V547 Expression 'c ==' x '&& c ==' X '' is always false. Probably the '||' operator should be used here. scanutils.cpp 135 The

correct check option:

if (c == 'x' || c == 'X') c = fgetc(s);

Undefined behavior

One interesting design was discovered. I haven’t seen this before:

void TabVector::Evaluate(....) {
  ....
  int num_deleted_boxes = 0;
  ....
  ++num_deleted_boxes = true;
  ....
}

Warning PVS-Studio: V567 Undefined behavior. The 'num_deleted_boxes' variable is modified while being used twice between sequence points. tabvector.cpp 735

It is not clear what the author wanted to say with this code. Most likely this code is a consequence of a typo.

The result of the expression cannot be predicted. The variable 'num_deleted_boxes' can be increased both before and after assignment. The reason is that the variable changes twice at the same point of succession .

Other errors leading to undefined behavior are associated with the use of shifts . Consider an example:

void Dawg::init(....)
{
  ....
  letter_mask_ = ~(~0 << flag_start_bit_);
  ....
}

Warning V610 Undefined behavior. Check the shift operator '<<. The left operand '~ 0' is negative. dawg.cpp 187

The expression '~ 0' is of type 'int' and is equal to the value '-1'. Shifting negative values results in undefined behavior. That the program can work correctly is luck and nothing more. You can fix the flaw by making '0' unsigned:

letter_mask_ = ~(~0u << flag_start_bit_);

However, that is not all. The analyzer generates another warning on this line:

V629 Consider inspecting the '~ 0 << flag_start_bit_' expression. Bit shifting of the 32-bit value with a subsequent expansion to the 64-bit type. dawg.cpp 187

The fact is that the variable 'letter_mask_' is of type 'uinT64'. As I understand it, it may be necessary to write units to the high 32 bits. In this case, the created expression is incorrect. It only works with low bits.

It is necessary to make so that '0' is a 64-bit type. Corrected version:

letter_mask_ = ~(~0ull << flag_start_bit_);

I’ll list a list of other code fragments where negative numbers are shifted:

dawg.cpp 188
intmatcher.cpp 172
intmatcher.cpp 174
intmatcher.cpp 176
intmatcher.cpp 178
intmatcher.cpp 180
intmatcher.cpp 182
intmatcher.cpp 184
intmatcher.cpp 186
intmatcher.cpp 188
intmatcher.cpp 190
intmatcher.cpp 192
intmatcher.cpp 194
intmatcher.cpp 196
intmatcher.cpp 198
intmatcher.cpp 200
intmatcher.cpp 202
intmatcher.cpp 323
intmatcher.cpp 347
intmatcher.cpp 366

Suspicious Double Assignment

TESSLINE* ApproximateOutline(....) {
  EDGEPT *edgept;
  ....
  edgept = edgesteps_to_edgepts(c_outline, edgepts);
  fix2(edgepts, area);
  edgept = poly2 (edgepts, area);  // 2nd approximation.
  ....
}

PVS-Studio Warning: V519 The 'edgept' variable is assigned values twice successively. Perhaps this is a mistake. Check lines: 76, 78. polyaprx.cpp 78

Another similar case:

inT32 row_words2(....)
{
  ....
  this_valid = blob_box.width () >= min_width;
  this_valid = TRUE;
  ....
}

PVS-Studio Warning: V519 The 'this_valid' variable is assigned values twice successively. Perhaps this is a mistake. Check lines: 396, 397. wordseg.cpp 397

Invalid class member initialization sequence

First, consider the 'MasterTrainer' class. Note that the class member 'samples_' is located before the member 'fontinfo_table_':

class MasterTrainer {
  ....
  TrainingSampleSet samples_;
  ....
  FontInfoTable fontinfo_table_;
  ....
};

According to the standard, the order of initialization of class members in the constructor occurs in the order of their declaration in the class. This means that 'samples_' will be initialized BEFORE the initialization of 'fontinfo_table_'.

Now consider the constructor:

MasterTrainer::MasterTrainer(NormalizationMode norm_mode,
                             bool shape_analysis,
                             bool replicate_samples,
                             int debug_level)
  : norm_mode_(norm_mode), samples_(fontinfo_table_),
    junk_samples_(fontinfo_table_),
    verify_samples_(fontinfo_table_),
    charsetsize_(0),
    enable_shape_anaylsis_(shape_analysis),
    enable_replication_(replicate_samples),
    fragments_(NULL), prev_unichar_id_(-1),
    debug_level_(debug_level)
{
}

The trouble is that the uninitialized variable 'fontinfo_table_' is also used to initialize 'samples_'.

A similar situation in this class with initialization of the fields 'junk_samples_' and 'verify_samples_'.

I do not presume to say what is best to do with this class. It might be enough to move the declaration of 'fontinfo_table_' to the very beginning of the class.

A typo in the condition

A typo is not easy to notice, but the analyzer does not know fatigue.

class ScriptDetector {
  ....
  int korean_id_;
  int japanese_id_;
  int katakana_id_;
  int hiragana_id_;
  int han_id_;
  int hangul_id_;
  int latin_id_;
  int fraktur_id_;
  ....
};
void ScriptDetector::detect_blob(BLOB_CHOICE_LIST* scores) {
  ....
  if (prev_id == katakana_id_)
    osr_->scripts_na[i][japanese_id_] += 1.0;
  if (prev_id == hiragana_id_)
    osr_->scripts_na[i][japanese_id_] += 1.0;
  if (prev_id == hangul_id_)
    osr_->scripts_na[i][korean_id_] += 1.0;
  if (prev_id == han_id_)
    osr_->scripts_na[i][korean_id_] += kHanRatioInKorean;
  if (prev_id == han_id_)             <<<<====
    osr_->scripts_na[i][japanese_id_] += kHanRatioInJapanese;
  ....
}

PVS-Studio Warning: V581 The conditional expressions of the 'if' operators located alongside each other are identical. Check lines: 551, 553. osdetect.cpp 553

Most likely, the most recent comparison should be like this:

if (prev_id == japanese_id_)

Unnecessary checks

There is no need to check what the 'new' operator returns. If memory allocation fails, an exception will be thrown. Of course, you can make a special operator 'new', which returns null pointers, but this is a separate case ( details ).

As a result, this function can be simplified:

void SetLabel(char_32 label) {
  if (label32_ != NULL) {
    delete []label32_;
  }
  label32_ = new char_32[2];
  if (label32_ != NULL) {
    label32_[0] = label;
    label32_[1] = 0;
  }
}

PVS-Studio Warning: V668 There is no sense in testing the 'label32_' pointer against null, as the memory was allocated using the 'new' operator. The exception will be generated in the case of memory allocation error. char_samp.h 73

There are 101 more places where the pointer that the 'new' operator returned is checked. To list them in the article, I do not see the point. It’s easier to run PVS-Studio for this.

Conclusion

Use static analysis regularly, and you will save a lot of time solving more useful tasks than catching silly errors and typos.

And don't forget to follow me on Twitter: @Code_Analysis . I regularly publish links to interesting C ++ related articles.

Have you read the article and have a question?

Often our articles are asked the same questions. We collected answers to them here: Answers to questions from readers of articles about PVS-Studio and CppCat, version 2014 . Please see the list.

Tags: