NCBI Genome Workbench: Scientific Research Under Threat

Modern computer technology, technical and software solutions - all this greatly facilitates and accelerates the conduct of various scientific studies. Often, computer simulation is the only way to test many theories. Scientific software has its own characteristics. For example, such software is often subjected to very thorough testing, but is poorly documented. However, software is written by people, and people make mistakes. Errors in scientific programs can call into question entire studies. This article will cover dozens of problems found in the code of the NCBI Genome Workbench software package.

Introduction

NCBI Genome Workbench offers researchers a wide range of tools for studying and analyzing genetic data. Users can explore and compare data from multiple sources, including NCBI (National Center for Biotechnology Information) databases or their own personal data.

As mentioned earlier, scientific software is usually well covered by unit tests. When checking this project, 85 directories with test files were excluded from the analysis. This is about a thousand files. Probably, this is due to the requirements for testing various complex algorithms that are invented for various studies. But the quality of the rest of the code (not test) is not at such a high level as we would like. However, as in any project that has not yet taken care of the implementation of static code analysis tools :).

Data for the review (or even research) of the code was provided by a static code analyzer for C / C ++ / C # / Java - PVS-Studio .

Just two digits that will ruin your project.

Based on our database of errors, which currently amounts to more than 12 thousand selected examples, we notice and describe special patterns of writing code that lead to numerous errors. For example, we conducted the following studies:

This project initiated the description of the new pattern. We are talking about the numbers 1 and 2 in the names of variables, for example, file1 and file2 , etc. It is very easy to confuse two such variables. This is a special case of typos in the code, but one error leads to the appearance of such errors - the desire to work with variables of the same name, differing only in numbers 1 and 2 at the end of the name.

Looking ahead a bit, I’ll say that all the listed studies were confirmed in the code of this project: D.

Consider the first example from the Genome Workbench project:

V501 There are identical sub-expressions '(! Loc1.IsInt () &&! Loc1.IsWhole ())' || operator. nw_aligner.cpp 480

CRef<CSeq_align> CNWAligner::Run(CScope &scope, const CSeq_loc &loc1,
                                 const CSeq_loc &loc2,
                                 bool trim_end_gaps)
{
  if ((!loc1.IsInt() && !loc1.IsWhole()) ||
      (!loc1.IsInt() && !loc1.IsWhole()))
  {
    NCBI_THROW(CException, eUnknown,
               "Only whole and interval locations supported");
  }
  ....
}

We see two variables, loc1 and loc2 . As well as errors in the code: variable loc2 not used, because instead of it once again used loc1 .

Another example:

V560 A part of conditional expression is always false: s1.IsSet (). valid_biosource.cpp 3073

staticbools_PCRPrimerSetLess(const CPCRPrimerSet& s1, const CPCRPrimerSet& s2){
  if (!s1.IsSet() && s1.IsSet()) {
    returntrue;
  } elseif (s1.IsSet() && !s2.IsSet()) {
    returnfalse;
  } elseif (!s1.IsSet() && !s2.IsSet()) {
    returnfalse;
  } elseif (s1.Get().size() < s2.Get().size()) {
    returntrue;
  } elseif (s1.Get().size() > s2.Get().size()) {
    returnfalse;
  } else {
  .....
}

The first line of the code mixed up the variables s1 and s2 . Based on the name, this is a comparison function. But such an error can be anywhere, because by calling the variables Number 1 and Number 2 , the programmer will almost certainly make a mistake in the future. And the more uses of such names in the function, the higher the probability of making a mistake.

Other typos and copy-paste

V501 There are identical sub-expressions to the left and the right! ”= 'Operator: bd.bit_.bits [i]! = Bd.bit_.bits [i] bm.h 296

boolcompare_state(const iterator_base& ib)const{
  ....
  if (this->block_type_ == 0
  {
    if (bd.bit_.ptr != ib_db.bit_.ptr) returnfalse;
    if (bd.bit_.idx != ib_db.bit_.idx) returnfalse;
    if (bd.bit_.cnt != ib_db.bit_.cnt) returnfalse;
    if (bd.bit_.pos != ib_db.bit_.pos) returnfalse;
    for (unsigned i = 0; i < bd.bit_.cnt; ++i)
    {
      if (bd.bit_.bits[i] != bd.bit_.bits[i]) returnfalse;
    }
  }
  ....
}

I suppose that after all the checks, the sizes of the arrays bits of the bd.bit_ and ib_db.bit_ objects are equal. Therefore, the author of the code wrote one cycle for element-by-element comparison of the bits arrays , but made a typo in the name of one of the compared objects. As a result, the compared objects can be mistakenly considered equal in some situations.

This example is worthy of the article " Evil lives in comparison functions ."

V501 There are identical sub-expressions 'CFieldHandler :: QualifierNamesAreEquivalent (field, kFieldTypeSeqId)' to the left | operator. field_handler.cpp 152

bool CFieldHandlerFactory::s_IsSequenceIDField(conststring& field)
{
  if (   CFieldHandler::QualifierNamesAreEquivalent(field, kFieldTypeSeqId)
      || CFieldHandler::QualifierNamesAreEquivalent(field, kFieldTypeSeqId)) {
    returntrue;
  } else {
    returnfalse;
  }
}

Most likely, one of the checks is superfluous. I did not find in the code variables similar to kFieldTypeSeqId . However, there may be an extra function call due to the operator "||", which degrades performance.

A couple of similar places with a warning analyzer that need to be checked:

V501 There are identical sub-expressions "uF-> GetData (). IsBool ()" variation_utils.cpp 1711
V501 There are identical sub-expressions "uF-> GetData (). IsBool ()" variation_utils.cpp 1735

V766 An item with the same key 'kArgRemote' has already been added. blast_args.cpp 3262

void
CBlastAppArgs::x_IssueWarningsForIgnoredOptions(const CArgs& args)
{
  set<string> can_override;
  ....
  can_override.insert(kArgOutputFormat);
  can_override.insert(kArgNumDescriptions);
  can_override.insert(kArgNumAlignments);
  can_override.insert(kArgMaxTargetSequences);
  can_override.insert(kArgRemote);               // <=
  can_override.insert(kArgNumThreads);
  can_override.insert(kArgInputSearchStrategy);
  can_override.insert(kArgRemote);               // <=
  can_override.insert("remote_verbose");
  can_override.insert("verbose");
  ....
}

The analyzer detected the addition of 2 identical values to the set container . Recall that this container stores only unique values, so duplicates are not added to it.

Code like the one above is often written using the copy-paste method. There may just be an extra value, or perhaps the author forgot to rename one of the variables when copied. When removing an extra call, the insert code is slightly optimized, which, however, is not essential. It is much more important that there may be a serious mistake hiding due to a missing element in the set.

V523 The 'then' statement is the subsequent code fragment. vcf_reader.cpp 1105

bool
CVcfReader::xAssignFeatureLocationSet(....)
{
  ....
  if (data.m_SetType == CVcfData::ST_ALL_DEL) {
    if (data.m_strRef.size() == 1) {
      //deletion of a single base
      pFeat->SetLocation().SetPnt().SetPoint(data.m_iPos-1);
      pFeat->SetLocation().SetPnt().SetId(*pId);
    }
    else {
      pFeat->SetLocation().SetInt().SetFrom(data.m_iPos-1);
      //-1 for 0-based, //another -1 for inclusive end-point ( i.e. [], not [) )
      pFeat->SetLocation().SetInt().SetTo( 
         data.m_iPos -1 + data.m_strRef.length() - 1); 
      pFeat->SetLocation().SetInt().SetId(*pId);
    }
    returntrue;
  }
  //default: For MNV's we will use the single starting point//NB: For references of size >=2, this location will not//match the reference allele.  Future Variation-ref//normalization code will address these issues,//and obviate the need for this code altogether.if (data.m_strRef.size() == 1) {
    //deletion of a single base
    pFeat->SetLocation().SetPnt().SetPoint(data.m_iPos-1);
    pFeat->SetLocation().SetPnt().SetId(*pId);
  }
  else {
    pFeat->SetLocation().SetInt().SetFrom(data.m_iPos-1);
    pFeat->SetLocation().SetInt().SetTo( 
      data.m_iPos -1 + data.m_strRef.length() - 1); 
    pFeat->SetLocation().SetInt().SetId(*pId);
  }
  returntrue;
}

The function contains large and completely identical code fragments. However, they contain different accompanying comments. The code is not optimally written, complicated, and possibly contains an error.

The entire list of suspicious places with an if-else operator looks like this:

V523 The 'then' statement is equivalent to the 'else' statement. blk.c 2142
V523 The 'then' statement is the subsequent code fragment. odbc.c 379
V523 The 'then' statement is the subsequent code fragment. odbc.c 1414
V523 The 'then' statement is equivalent to the 'else' statement. seqdbvol.cpp 1922
V523 The 'then' statement is equivalent to the 'else' statement. seqdb_demo.cpp 466
V523 The 'then' statement is the subsequent code fragment. blast_engine.c 1917
V523 The 'then' statement is equivalent to the 'else' statement. blast_filter.c 420
V523 The 'then' statement is equivalent to the 'else' statement. blast_parameters.c 636
V523 The 'then' statement is equivalent to the 'else' statement. unordered_spliter.cpp 684
V523 The 'then' statement is equivalent to the 'else' statement. bme.cpp 333
V523 The 'then' statement is equivalent to the 'else' statement. gme.cpp 484

/ * with security is best be pedantic * /

V597 The compiler couldn’t need the function call, which is used to flush 'passwd_buf' buffer. The memset_s () function should be used to erase the private data. challenge.c 366

/**
 * Crypt a given password using schema required for NTLMv1 authentication
 * @param passwd clear text domain password
 * @param challenge challenge data given by server
 * @param flags NTLM flags from server side
 * @param answer buffer where to store crypted password
 */voidtds_answer_challenge(....){
#define MAX_PW_SZ 14
  ....
  if (ntlm_v == 1) {
    ....
    /* with security is best be pedantic */memset(hash, 0, sizeof(hash));
    memset(passwd_buf, 0, sizeof(passwd_buf));
    memset(ntlm2_challenge, 0, sizeof(ntlm2_challenge));
  } else {
    ....
  }
}

As you probably already guessed, in the section title an amusing comment about security from the code is used.

In short, the memset function will be deleted by the compiler, because cleared buffers are no longer used. And data such as hash or passwd_buf , in fact, will not be overwritten with zeros. For more information about this unobvious compiler mechanism, see the article " Safely cleans private data ."

V597 The compiler couldn’t delete the memset function call, which is used to flush the answer object. The memset_s () function should be used to erase the private data. challenge.c 561

static TDSRET
tds7_send_auth(....){
  ....
  /* for security reason clear structure */memset(&answer, 0, sizeof(TDSANSWER));
  return tds_flush_packet(tds);
}

That was not the only example with comments about "security." Judging by the comments, we can assume that security is really important for the project. Therefore, I’m enclosing a whole list of problems that are not small:

V597 The compiler couldn’t delete the memset function call, which is used to flush the heap object. The memset_s () function should be used to erase the private data. ncbi_heapmgr.c 1300
V597 The compiler couldn’t delete the memset function call, which is used to flush the context. The memset_s () function should be used to erase the private data. challenge.c 167
V597 The compiler couldn’t delete the memset function call, which is used to flush the ks object. The memset_s () function should be used to erase the private data. challenge.c 339
V597 The compiler could delete the memset function call, which is used to flush the md5_ctx object. The memset_s () function should be used to erase the private data. challenge.c 353
V597 The compiler couldn’t delete the memset function call, which is used to flush 'hash' buffer. The memset_s () function should be used to erase the private data. challenge.c 365
V597 The compiler couldn’t delete the memset function call, which is used to flush the ks object. The memset_s () function should be used to erase the private data. challenge.c 406
V597 The compiler couldn’t make it the function call, which is used to flush 'ntlm_v2_response' object. The memset_s () function should be used to erase the private data. login.c 795
V597 The compiler couldn’t delete the memset function call, which is used to flush the answer object. The memset_s () function should be used to erase the private data. login.c 801
V597 The compiler couldn’t have the function call, which is used to flush 'packet' buffer. The memset_s () function should be used to erase the private data. numeric.c 256
V597 The compiler couldn’t have the function call, which is used to flush 'packet' buffer. The memset_s () function should be used to erase the private data. numeric.c 110
V597 The compiler couldn’t delete the memset function call, which is used to flush the pwd buffer. The memset_s () function should be used to erase the private data. getpassarg.c 50
V597 The compiler couldn’t delete the memset function call, which is used to flush the context. The memset_s () function should be used to erase the private data. challenge.c 188
V597 The compiler couldn’t delete the memset function call, which is used to flush the buf buffer. The memset_s () function should be used to erase the private data. challenge.c 243
V597 The compiler couldn’t make it the function call, which is used to flush 'ntlm_v2_hash' buffer. The memset_s () function should be used to erase the private data. challenge.c 309
V597 The compiler could delete the memset function call, which is used to flush the md5_ctx object. The memset_s () function should be used to erase the private data. challenge.c 354
V597 The compiler couldn’t need the function call, which is used to flush 'passwd_buf' buffer. The memset_s () function should be used to erase the private data. challenge.c 380
V597 The compiler couldn’t delete the memset function call, which is used to flush the ks object. The memset_s () function should be used to erase the private data. challenge.c 393
V597 The compiler couldn’t delete the memset function call, which is used to flush 'hash' buffer. The memset_s () function should be used to erase the private data. challenge.c 394
V597 The compiler couldn’t have the function call, which is used to flush the ntlm2_challenge buffer. The memset_s () function should be used to erase the private data. challenge.c 395
V597 The compiler couldn’t delete the memset function call, which is used to flush the ks object. The memset_s () function should be used to erase the private data. challenge.c 419
V597 The compiler couldn’t make it the function call, which is used to flush 'ntlm_v2_response' object. The memset_s () function should be used to erase the private data. challenge.c 556

Suspicious cycles

V534 It is likely that a variable is being compared inside the 'for' operator. Consider reviewing 'i'. taxFormat.cpp 569

void CTaxFormat::x_LoadTaxTree(void)
{
  ....
  for(size_t i = 0; i < alignTaxids.size(); i++) {
    int tax_id = alignTaxids[i];
    ....
    for(size_t j = 0; i < taxInfo.seqInfoList.size(); j++) {
      SSeqInfo* seqInfo = taxInfo.seqInfoList[j];
      seqInfo->taxid = newTaxid;
    }
    ....
  }
  ....
}

I think, in the condition of the inner cycle, variable i was randomly crammed. Instead, the variable j should be used .

V535 The variable 'i' is being used for the loop. Check lines: 302, 309. sls_alp.cpp 309

alp::~alp()
{
  ....
  if(d_alp_states)
  {
    for(i=0;i<=d_nalp;i++)           // <=
    {
      if(i<=d_alp_states->d_dim)
      {
        if(d_alp_states->d_elem[i])
        {
          for(i=0;i<=d_nalp;i++)     // <=
          {
            ....
  ....
}

Two nested identical cycles, in which the global counter is also reset, look well, very suspiciously. Developers should check out what's going on here.

Abnormal Array Indexing

V520 The comma operator ',' in array index expression '[- i2, - k]'. nw_spliced_aligner16.cpp 564

void CSplicedAligner16::x_DoBackTrace (
    const Uint2* backtrace_matrix, CNWAligner::SAlignInOut* data,
    int i_global_max, int j_global_max)
{
  ....
  while(intron_length < m_IntronMinSize || (Key & donor) == 0) {
      Key = backtrace_matrix[--i2, --k];
      ++intron_length;
      data->m_transcript.push_back(eTS_Intron);
  }
  ....
}

At once I will say that there seems to be no error here (so far, lol). Consider the following line:

Key = backtrace_matrix[--i2, --k];

The word 'matrix' and double indexing may suggest that the array is two-dimensional, but it is not. This is a regular pointer to an array of integers. But the diagnosis of the V520 did not just appear. Programmers are really confused about how to index two-dimensional arrays.

In this case, the author simply decided to save on a single line of code, although he could write this:

--i2;
Key = backtrace_matrix[--k];

V661 A suspicious expression 'A [B == C]'. Probably meant 'A [B] == C'. ncbi_service_connector.c 180

static EHTTP_HeaderParse s_ParseHeader(constchar* header, ....){
  ....
  if (sscanf(header, "%u.%u.%u.%u%n", &i1, &i2, &i3, &i4, &n) < 4
      ||  sscanf(header + n, "%hu%x%n", &uuu->port, &tkt, &m) < 2
      || (header[m += n]  &&  !(header[m] == '$')  &&
          !isspace((unsignedchar)((header + m)
                                   [header[m] == '$'])))) {
      break/*failed - unreadable connection info*/;
  }
  ....
}

Another example of code in which I tried for a long time to understand what was happening: D. The isspace () function checks the character with the index m , but if this character is '$', then the character with the index m + 1 is passed to the function . In this case, the comparison with '$' was already in advance. Perhaps there is no error here, but the code can be accurately rewritten more clearly.

V557 Array overrun is possible. The 'row' index is pointing beyond array bound. aln_reader.cpp 412

bool CAlnReader::x_IsGap(TNumrow row, TSeqPos pos, conststring& residue)
{
  if (m_MiddleSections.size() == 0) {
    x_CalculateMiddleSections();
  }
  if (row > m_MiddleSections.size()) {
      returnfalse;
  }
  if (pos < m_MiddleSections[row].first) {
    ....
  }
  ....
}

Here there is a serious mistake. The correct row index check should be like this:

if (row >= m_MiddleSections.size()) {
  returnfalse;
}

Otherwise, it is possible to access data outside of the MiddleSections vector .

Many more places like this:

V557 Array overrun is possible. The 'i' index is pointing beyond array bound. resource_pool.hpp 388
V557 Array overrun is possible. The 'row' index is pointing beyond array bound. aln_reader.cpp 418
V557 Array overrun is possible. The 'fmt_idx' index is pointing beyond array bound. seq_writer.cpp 384
V557 Array overrun is possible. The 'fmt_idx' index is pointing beyond array bound. blastdb_formatter.cpp 183
V557 Array overrun is possible. The 'num' index is beyond array bound. newcleanupp.cpp 13035

How to earn distrust of functions

V570 The 'm_onClickFunction' variable is assigned to itself. alngraphic.hpp 103

voidSetOnClickFunctionName(string onClickFunction){
  m_onClickFunction = m_onClickFunction;
}

Here there is nothing to even comment on. One can only sympathize with the person who clicked something, clicked, but nothing has changed.

Two more cases of assigning variables to myself will give a list:

V570 The 'iter-> level' variable is assigned to itself. align_format_util.cpp 189
V570 The 'd_elements_values [ind]' variable is assigned to itself. sls_alp_data.cpp 1416

V763 Parameter 'w1' is always rewritten in function body before being used. bmfunc.h 5363

/// Bit COUNT functortemplate<typename W> structbit_COUNT
{W operator()(W w1, W w2){
    w1 = 0;
    BM_INCWORD_BITCOUNT(w1, w2);
    return w1;
  }
};

The function in which the argument is ground immediately upon entering the function may be misleading by the developers using it. The code should be double-checked.

Errors when designing classes

V688 The 'm_qsrc' function argument. compart_matching.cpp 873

classCElementaryMatching:public CObject
{
  ....
  ISequenceSource * m_qsrc;
  ....
  voidx_CreateIndex(ISequenceSource *m_qsrc, EIndexMode index_more, ....);
  voidx_CreateRemapData(ISequenceSource *m_qsrc, EIndexMode mode);
  voidx_LoadRemapData(ISequenceSource *m_qsrc, conststring& sdb);
  ....
};

Immediately 3 functions of the class contain arguments whose names coincide with the field of the class. This can lead to errors in the function bodies: the programmer may think that he is working with a member of a class, actually changing the value of a local variable.

V614 Uninitialized variable 'm_BitSet' used. SnpBitAttributes.hpp 187

/// SNP bit attribute container.classCSnpBitAttributes
{public:
  ....
private:
  /// Internal storage for bits.
  Uint8 m_BitSet;
};
inline CSnpBitAttributes::CSnpBitAttributes(Uint8 bits) : m_BitSet(bits)
{
}
inline CSnpBitAttributes::CSnpBitAttributes(constvector<char>& octet_string)
{
  auto count = sizeof(m_BitSet);
  auto byte = octet_string.end();
  do
    m_BitSet = (m_BitSet << 8) | *--byte;
  while (--count > 0);
}

One of the constructors works inappropriately with the m_BitSet variable . The fact is that the variable is uninitialized. Its “garbage” value is used at the first iteration of the loop, after which initialization occurs. This is a very serious mistake, leading to an undefined program behavior.

V603 The object was not used. If you wish to call the constructor, 'this-> SIntervalComparisonResult :: SIntervalComparisonResult (....)' should be used. compare_feats.hpp 100

//This struct keeps the result of comparison of two exonsstructSIntervalComparisonResult : CObject
{
public:
  SIntervalComparisonResult(unsigned pos1, unsigned pos2,
                            FCompareLocs result, int pos_comparison = 0) 
  : m_exon_ordinal1(pos1), m_exon_ordinal2(pos2),
    m_result(result), m_position_comparison(pos_comparison) {}
  SIntervalComparisonResult()
  {
    SIntervalComparisonResult(0, 0, fCmp_Unknown, 0);
  }
  ....
};

A long time ago I did not encounter such errors when checking projects. But the problem is still relevant. The mistake is that calling a parameterized constructor in this way results in the creation and deletion of a temporary object. And class fields remain uninitialized. Another constructor should be called via the initialization list (see Delegating constructor ).

V591 Non-void function should return a value. bio_tree.hpp 266

/// Recursive assignment
CBioNode& operator=(const CBioNode& tree)
{
  TParent::operator=(tree);
  TBioTree* pt = (TBioTree*)tree.GetParentTree();
  SetParentTree(pt);
}

The analyzer considers that there is not enough line in the overloaded operator:

return *this;

V670 The uninitialized class member 'm_OutBlobIdOrData' is used to initialize the 'm_StdOut' member. Remember that members are initialized in the order of declarations inside a class. remote_app.hpp 215

classNCBI_XCONNECT_EXPORTCRemoteAppResult
{public:
  CRemoteAppResult(CNetCacheAPI::TInstance netcache_api,
          size_t max_inline_size = kMaxBlobInlineSize) :
      m_NetCacheAPI(netcache_api),
      m_RetCode(-1),
      m_StdOut(netcache_api, m_OutBlobIdOrData, m_OutBlobSize),
      m_OutBlobSize(0),
      m_StdErr(netcache_api, m_ErrBlobIdOrData, m_ErrBlobSize),
      m_ErrBlobSize(0),
      m_StorageType(eBlobStorage),
      m_MaxInlineSize(max_inline_size)
  {
  }
  ....
};

On this fragment of the code 3 warnings of the analyzer are issued immediately. Class fields are initialized not in the order in which they are listed in the initialization list, but in how they are declared in the class. The classic reason for the error is that not all programmers remember or know about this rule. Here and in the initialization list is just the wrong order. One gets the feeling that the list of fields was entered in a random order.

V746 Object slicing. By reference rather than by value. cobalt.cpp 247

void 
CMultiAligner::SetQueries(constvector< CRef<objects::CBioseq> >& queries)
{
  ....
  try {
    seq_loc->SetId(*it->GetSeqId());
  }
  catch (objects::CObjMgrException e) {
    NCBI_THROW(CMultiAlignerException, eInvalidInput,
               (string)"Missing seq-id in bioseq. " + e.GetMsg());
  }
  m_tQueries.push_back(seq_loc);
  ....
}

Intercepting exceptions by value can result in the loss of some exception information due to the creation of a new object. It is much better and safer to catch an exception by reference.

Similar places:

V746 Object slicing. By reference rather than by value. agp_validate_reader.cpp 562
V746 Object slicing. By reference rather than by value. aln_build_app.cpp 320
V746 Object slicing. By reference rather than by value. aln_test_app.cpp 458
V746 Object slicing. By reference rather than by value. cobalt.cpp 691
V746 Object slicing. By reference rather than by value. cobalt.cpp 719
V746 Object slicing. By reference rather than by value. cobalt.cpp 728
V746 Object slicing. By reference rather than by value. cobalt.cpp 732

About unreachable code and other code execution issues

V779 Unreachable code detected. It is possible that an error is present. merge_tree_core.cpp 627

bool CMergeTree::x_FindBefores_Up_Iter(....)
{
    ....
    FirstFrame->Curr = StartCurr;
    FirstFrame->Returned = false;
    FirstFrame->VisitCount = 0;
    FrameStack.push_back(FirstFrame);
    while(!FrameStack.empty()) {
        ....
        if(Rel == CEquivRange::eAfter) {
            Frame->Returned = false;
            FrameStack.pop_back();
            continue;
        } 
        elseif(Rel == CEquivRange::eBefore) {
            ....
            continue;
        }
        else {
            if(Frame->VisitCount == 0) {
                ....
                continue;
            } else {
                ....
                continue;
            }
        }
        Frame->Returned = false; // <=
        FrameStack.pop_back();
        continue;
    }  // end stack loop
    FirstFrame->ChildFrames.clear();
    return FirstFrame->Returned;
}

The code of the conditional operator is written in such a way that absolutely all branches of the code end with the continue operator . This led to the fact that in a while loop several lines of unreachable code were formed. These lines look very suspicious. Most likely, such a problem arose after the code was refactored, and now a careful code-review is required here.

V519 The 'interval interval' variable is assigned values twice successively. Perhaps this is a mistake. Check lines: 454, 456. aln_writer.cpp 456

void CAlnWriter::AddGaps(....)
{
  ....
  switch(exon_chunk->Which()) {
  case CSpliced_exon_chunk::e_Match:
      interval_width = exon_chunk->GetMatch();
  case CSpliced_exon_chunk::e_Mismatch:
      interval_width = exon_chunk->GetMismatch();
  case CSpliced_exon_chunk::e_Diag:
      interval_width = exon_chunk->GetDiag();
      genomic_string.append(....);
      product_string.append(....);
      genomic_pos += interval_width;
      product_pos += interval_width/res_width;
      break;
    ....
  }
  ....
}

The variable interval_width is overwritten several times, because There are no break statements in the case branches . Though a classic, but very bad mistake. A few more suspicious places:

V779 Unreachable code detected. It is possible that an error is present. dbapi_driver_utils.cpp 351
V779 Unreachable code detected. It is possible that an error is present. net.c 780
V779 Unreachable code detected. It is possible that an error is present. bcp.c 1495
V779 Unreachable code detected. It is possible that an error is present. remote_blast.cpp 1470
V779 Unreachable code detected. It is possible that an error is present. remote_blast.cpp 1522

V571 Recurring check. The 'if (m_QueryOpts-> filtering_options)' condition was already verified in line 703. blast_options_local_priv.hpp 713

inlinevoid
CBlastOptionsLocal::SetFilterString(constchar* f)
{
  ....
  if (m_QueryOpts->filtering_options)      // <=
  {
    SBlastFilterOptions* old_opts = m_QueryOpts->filtering_options;
    m_QueryOpts->filtering_options = NULL;
    SBlastFilterOptionsMerge(&(m_QueryOpts->filtering_options),
      old_opts, new_opts);
    old_opts = SBlastFilterOptionsFree(old_opts);
    new_opts = SBlastFilterOptionsFree(new_opts);
  } 
  else
  {
     if (m_QueryOpts->filtering_options)   // <=
         m_QueryOpts->filtering_options = 
             SBlastFilterOptionsFree(m_QueryOpts->filtering_options);
     m_QueryOpts->filtering_options = new_opts;
     new_opts = NULL;
  }
  ....
}

Obviously, the else branch requires rewriting. I have a few ideas on what they wanted to do with the m_QueryOpts-> filtering_options pointer , but the code is still kind of tangled. I appeal to the authors of the code.

Well, the problem does not come alone:

V571 Recurring check. The 'if (sleeptime)' condition was already verified in line 205. request_control.cpp 208
V571 Recurring check. The 'if (assignValue.empty ())' condition was already verified in line 712. classstr.cpp 718

Read errors

V739 EOF should not be compared with a value of the 'char' type. The 'linestring [0]' should be of the 'int' type. alnread.c 3509

static EBool
s_AfrpInitLineData(
  ....
  char* linestring = readfunc (pfile);
  ....
  while (linestring != NULL  &&  linestring [0] != EOF) {
    s_TrimSpace (&linestring);
    ....
  }
  ....
}

Characters to be compared with EOF should not be stored in variables of type char . Otherwise, there is a risk that a character with a value of 0xFF (255) becomes -1 and will be interpreted in the same way as the end of file (EOF). Also (just in case) it is worth checking the implementation of the readfunc function .

V663 Infinite loop is possible. The 'cin.eof ()' condition is not a loop from the loop. Consider adding the 'cin.fail ()' function call to the conditional expression. ncbicgi.cpp 1564

typedefstd::istream CNcbiIstream;
void CCgiRequest::Serialize(CNcbiOstream& os) const
{
  ....
  CNcbiIstream* istrm = GetInputStream();
  if (istrm) {
    char buf[1024];
    while(!istrm->eof()) {
      istrm->read(buf, sizeof(buf));
      os.write(buf, istrm->gcount());
    }
  }
}

The analyzer has detected a potential error due to which an infinite loop may occur. In the event of a failure in reading the data, a call to the eof () function will always return false . To complete the loop in this case, additional checking of the value returned by the function fail () is necessary .

Mistakes

V502 Perhaps the '?:' Operator was different. The '?:' Operator has a operator. ncbi_connutil.c 1135

staticconstchar* x_ClientAddress(constchar* client_host,
                                   int/*bool*/ local_host){
  ....
  if ((client_host == c  &&  x_IsSufficientAddress(client_host))
      ||  !(ip = *c  &&  !local_host
            ? SOCK_gethostbyname(c)
            : SOCK_GetLocalHostAddress(eDefault))
      ||  SOCK_ntoa(ip, addr, sizeof(addr)) != 0
      ||  !(s = (char*) malloc(strlen(client_host) + strlen(addr) + 3))) {
      return client_host/*least we can do :-/*/;
  }
  ....
}

Pay attention to the expression:

!local_host ? SOCK_gethostbyname(c) : SOCK_GetLocalHostAddress(eDefault)

It is not calculated as the programmer expected, because the whole expression looks like this:

ip = *c  && !local_host ? SOCK_gethostbyname(c) : SOCK_GetLocalHostAddress(...)

The priority of && operator is higher than that of ?: . For this reason, the code is not executed as intended.

V561 It's probably better to assign value to 'seq' variable than to declare it anew. Previous declaration: validator.cpp, line 490. validator.cpp 492

bool CValidator::IsSeqLocCorrectlyOrdered(const CSeq_loc& loc, CScope& scope)
{
  CBioseq_Handle seq;
  try {
    CBioseq_Handle seq = scope.GetBioseqHandle(loc);
  } catch (CObjMgrException& ) {
    // no way to tellreturntrue;
  } catch (const exception& ) {
    // no way to tellreturntrue;
  }
  if (seq  &&  seq.GetInst_Topology() == CSeq_inst::eTopology_circular) {
    // no way to check if topology is circularreturntrue;
  }
  return CheckConsecutiveIntervals(loc, scope, x_IsCorrectlyOrdered);
}

Due to the fact that the programmer declared the new variable seq inside the try / catch section, the other variable seq remains uninitialized and is used below by code.

V562 It's a bit to compare value with a value of 0: ((((status) & 0x7f) == 0)! = 0. ncbi_process.cpp 111

bool CProcess::CExitInfo::IsExited(void) const
{
    EXIT_INFO_CHECK;
    if (state != eExitInfo_Terminated) {
        returnfalse;
    }
#if   defined(NCBI_OS_UNIX)return WIFEXITED(status) != 0;
#elif defined(NCBI_OS_MSWIN)// The process always terminates with exit codereturntrue;
#endif
}

Nothing foreshadowed trouble, but WIFEXITED turned out to be a macro, opening like this:

return (((status) & 0x7f) == 0) != 0;

It turns out that the function returns the opposite value.

The code found another such function for which a warning was issued:

V562 It's a bit to compare the value of a bool type with a value of 0. ncbi_process.cpp 126

V595 The 'dst_len' pointer was used before it was verified against nullptr. Check lines: 309, 315. zlib.cpp 309

bool CZipCompression::CompressBuffer(
  constvoid* src_buf, size_t  src_len,
  void*       dst_buf, size_t  dst_size,
  /* out */size_t* dst_len)
{
  *dst_len = 0;
  // Check parametersif (!src_len  &&  !F_ISSET(fAllowEmptyData)) {
    src_buf = NULL;
  }
  if (!src_buf || !dst_buf || !dst_len) {
    SetError(Z_STREAM_ERROR, "bad argument");
    ERR_COMPRESS(48, FormatErrorMessage("CZipCompression::CompressBuffer"));
    returnfalse;
  }
  ....
}

The dst_len pointer is dereferenced at the very beginning of the function, while further on the code is checked for equality to zero. An error is made in the code that leads to undefined behavior if the dst_len pointer turns out to be nullptr .

V590 Consider inspecting the 'ch! =' \ 0 '&& ch ==' '' expression. The expression is misprint. cleanup_utils.cpp 580

boolAsn2gnbkCompressSpaces(string& val){
  ....
  while (ch != '\0' && ch == ' ') {
    ptr++;
    ch = *ptr;
  }
  ....
}

The condition for stopping the loop depends only on whether the character ch is a space or not. The expression can be simplified to this:

while (ch == ' ') {
  ....
}

Conclusion

The use of computer programs in scientific research helps and will help to make discoveries. Let's hope that the most important of them will not be missed because of any typos.

I invite the developers of the NCBI Genome Workbench project to contact us, and we will provide a full report issued by the PVS-Studio analyzer.

I hope that this small code study will help to correct many errors and, in general, improve the reliability of the project. Try running PVS-Studio on the code of your projects, if you haven’t done so yet. You might like it :).

If you want to share this article with an English-speaking audience, then please use the link to the translation: Svyatoslav Razmyslov. NCBI Genome Workbench: Scientific Research under Threat

Tags:

﻿NCBI Genome Workbench: Scientific Research Under Threat