Valgrind is good, but not enough

    Not so long ago we tried to demonstrate the benefits of using one of the companies PVS-Studio static analyzer. Nothing came of it. But in the process of correspondence, I prepared a detailed answer regarding the methodologies of static and dynamic analysis. Now I decided to write this answer in the form of a small article. I think the text may seem interesting to readers, and it’s just possible to use this article when communicating with new potential customers.

    So, in the process of correspondence a question was asked approximately of the following content:

    We already experimented with static analyzers and came to the conclusion that their accuracy is much worse than the usual valgrind. Therefore, it is not clear why static analysis is needed. There are a lot of false positives, and errors that running valgrind cannot find are usually not found.

    I prepared the following answer, which I cite, making only minor corrections: It is

    quite difficult to show the advantages of static analysis on two small projects. Firstly, they are of high quality. Secondly, static analysis is primarily focused on finding and eliminating errors in the new code. Thirdly, the error density in small projects is lower than in large ones ( explanation ).

    Trying to find something in a long and stable code is a rather thankless task. The point of static analysis is to prevent many errors at the earliest stages. Yes, most of these errors can be found by other methods. They will be noticed either by the programmer himself, or by large tests or testers. In the worst case, users will report errors. But in any case, it is a waste of time. Many typos, Copy-Paste errors, and other mistakes can be eliminated at an early stage using static analysis. Finding many errors right after writing the code is its main value. Finding an error at any next stage is many times more expensive.

    For some reason, after that, everyone immediately says that our programmers do not make typos and Copy-Paste. It is not true. They do. Everyone does:http://www.viva64.com/en/b/0260/

    Well, let's say we convinced you that the static analysis may find some kind of errors. The question is true, but is it needed, since there are tools like valgrind. After all, they really give less false positives.

    Unfortunately, yes, they are needed. There is no one technology that can detect all types of errors. Sadly, to improve the quality it is required to use different types of tools so that they complement each other.

    We already wrote about where static analysis helps other technologies. For example, we described the difference between static and dynamic code analysis in this note: http://www.viva64.com/en/b/0248/

    Another note on how static analysis complements testing using unit tests:http://www.viva64.com/en/a/0080/

    However, in order not to be completely abstract, let us try to explain the difference between static and dynamic analysis using examples. For example, let's highlight such an interesting fragment in the constructor of the SlowScanner class:
    class SlowScanner {
      ....
      explicit SlowScanner(Fsm& fsm)
      {
        ....
        Fill(m_letters,
             m_letters + sizeof(m_letters)/sizeof(*m_letters), 0);
        ....
      }
      ....
      size_t* m_letters;
      ....
    }

    The PVS-Studio analyzer generates a warning: V514 Dividing sizeof a pointer 'sizeof (m_letters)' by another value. There is a probability of logical error presence. slow.h 238

    Most likely, once a member of the class 'm_letters' was an array with a fixed size. Of course, this is just an assumption, but quite likely. For example, at first it was written something like this: size_t m_letters [MAX_COUNT] ;. In those days, determining the size of the array was correct:
    sizeof(m_letters)/sizeof(*m_letters)

    Then the array became dynamic, and the variable 'm_letters' turned into a simple pointer. Now the expression "sizeof (m_letters) / sizeof (* m_letters)" is always equal to one. On a 32-bit system, the size of the pointer and type size_t are 4. On a 64-bit system, the size of these types will be 8. However, regardless of whether we divide 4 by 4 or 8 by 8, we always get 1.

    Thus, the Fill function () zeros only one byte of the array. The error may not manifest itself at all if the memory is accidentally already reset to zero or if uninitialized elements are not used. This is her cunning. But reading of an uninitialized element of an array may occur.

    Can a dynamic analyzer find this error? I do not know. Perhaps it can detect reading from uninitialized memory. Then why is he silent? Here we come to one of the important differences between static and dynamic analysis.

    Most likely, this code branch is rarely used, or at least it is not covered by tests. As a result, the dynamic analyzer simply does not check this code and does not see an error. The disadvantage of dynamic analysis is that it is difficult to cover all possible code branches. As a result, rarely used code remains untested, and especially various error and non-standard error handlers.

    Static analysis checks all branches that theoretically can get control. Therefore, it can find errors, regardless of how often one or another code is executed.

    By the way, let's move a little away from the topic. We can not only offer you our analyzer, but also code audit services . The result of such an audit may be a document that includes a set of recommendations for improvement. It may well be included in the coding standard. We already have experience in carrying out such work. For example, in order to avoid errors associated with calculating the size of the array, you can recommend using a special technology (spied on in Chromium):
    template 
    char (&ArraySizeHelper(T (&array)[N]))[N];
    #define arraysize(array) (sizeof(ArraySizeHelper(array)))

    The macro 'arraysize' cannot be applied to an ordinary pointer. A compilation error will occur. Thus, we protect ourselves against accidental errors. If suddenly the array turns into a pointer, you cannot skip the place where its size is calculated.

    Let's return to the static and dynamic analysis. For example, look at this function:
    inline RECODE_RESULT _rune2hex(wchar32 in,
      char* out, size_t out_size, size_t &out_writed)
    {
        static const char hex_digs[]="0123456789ABCDEF";
        bool leading = true;
        out_writed = 0;
        RECODE_RESULT res = RECODE_OK;
        for (int i = 7; i >=0; i--){
            unsigned char h = (unsigned char)(in>>(i*4) & 0x0F);
            if (h || !leading || i==0){
                if (out_writed + 1 >= out_size){
                    res = RECODE_EOOUTPUT;
                    break;
                }
                out[out_writed++] = hex_digs[h];
            }
        }
        return res;
    }

    From the point of view of dynamic analysis, there is nothing suspicious here. In turn, the PVS-Studio static analyzer suggests paying attention to the 'leading' variable: V560 A part of conditional expression is always false:! Leading. recyr_int.hh 220

    I think there is no mistake. The 'leading' variable turned out to be superfluous after refactoring. What if not? Suddenly the code is not added? This is the place to pay attention to. And, if the variable is superfluous, then delete it so that it does not confuse not only the analyzer, but also the people who will support this code.

    Warnings that part of an expression is always constant may seem uninteresting. Then look at the examples of errors found with the V560 diagnostic and you will be surprised that you will not find it in the code:http://www.viva64.com/en/examples/V560/

    Such errors cannot be found by dynamic analysis. He has nothing to look for here. These are simply incorrect logical expressions.

    Unfortunately, the proposed projects do not allow comprehensively show the benefits of a static analyzer. We turn to one of the libraries, which is part of the project. Indeed, in a sense, an error in the library is also an error in the project itself.

    Consider the sslDeriveKeys function working with private data:
    int32 sslDeriveKeys(ssl_t *ssl)
    {
      ....
      unsigned char buf[SSL_MD5_HASH_SIZE + SSL_SHA1_HASH_SIZE];
      ....
      memset(buf, 0x0, SSL_MD5_HASH_SIZE + SSL_SHA1_HASH_SIZE);
      psFree(ssl->sec.premaster);
      ssl->sec.premaster = NULL;
      ssl->sec.premasterSize = 0;
    skipPremaster:
      if (createKeyBlock(ssl, ssl->sec.clientRandom,
            ssl->sec.serverRandom,
            ssl->sec.masterSecret, SSL_HS_MASTER_SIZE) < 0)
      {
        matrixStrDebugMsg("Unable to create key block\n", NULL);
        return -1;
      }
      return SSL_HS_MASTER_SIZE;
    }

    The dynamic analyzer will not find anything here. The code in terms of language is absolutely correct. To find a mistake, you need to think in higher-level patterns that static analyzers can do.

    We are interested in the local array 'buf'. Since it stores private data, an attempt was made at the end of the function to reset this array using the memset () function. This is the mistake.

    See, after calling memset (), the local array 'buf' is no longer used. This means that the compiler has the right to remove the memset () function call, since its call has no effect from the point of view of the C / C ++ language. Moreover, it is not only in law, but will really remove this function in the release version.

    As a result, private data will remain in memory and theoretically can go where it does not need to go. Thus, the project becomes slightly less secure due to an error in a third-party library.

    PVS-Studio will issue the following warning: V597 The compiler could delete the 'memset' function call, which is used to flush 'buf' buffer. The RtlSecureZeroMemory () function should be used to erase the private data. sslv3.c 123

    This error is a potential vulnerability. It may seem that it is extremely insignificant. However, it can lead to very unpleasant consequences, up to sending fragments of private data over the network. How such miracles can happen is described in an article by ABBYY specialist Dmitry Meshcheryakov:http://habrahabr.ru/company/abbyy/blog/127259/

    I hope we were able to show the differences between the static and dynamic code analyzer. These two approaches complement each other well. The fact that static analysis produces a lot of false positives is not a problem. You can work with them and eliminate, configure the analyzer. In case of interest, we can do such work to reduce the number of warnings to a comfortable level.

    If we are of interest to you, we suggest that you outline the next steps for possible cooperation and demonstrate the capabilities of the analyzer on live large real-life projects.

    Also popular now: