Technologies used in the PVS-Studio code analyzer to search for errors and potential vulnerabilities

    Technology and magic

    A brief description of the technologies used in the PVS-Studio tool that allow you to effectively detect a large number of error patterns and potential vulnerabilities. The article describes the implementation of the analyzer for C and C ++ code, but the information provided is also valid for the modules responsible for analyzing C # and Java code.

    Introduction


    There are misconceptions that static code analyzers are fairly simple programs based on the search for code patterns using regular expressions. This is far from the truth. Moreover, identifying the vast majority of errors using regular expressions is simply impossible .

    Misconception arose based on the experience of programmers when working with some tools that existed 10-20 years ago. The work of tools often really came down to finding dangerous code patterns and functions such as strcpy , strcat , etc. As a representative of this class of tools can be called RATS .

    Such tools, although they could be useful, were generally confused and ineffective. It was from that time that many programmers still have memories that static analyzers are very useless tools that interfere more with the work than help it.

    Time passed, and static analyzers began to be complex solutions that perform in-depth code analysis and find errors that remain in the code even after a careful code review. Unfortunately, due to past negative experience, many programmers still consider the static analysis methodology to be useless and do not rush to implement it in the development process.

    In this article I will try to correct the situation a little. I ask readers to devote 15 minutes of time and get acquainted with the technologies used in the PVS-Studio static code analyzer for error detection. Perhaps after this you will have a new look at the static analysis tools and want to apply them in your work.

    Data Flow Analysis


    Data flow analysis allows you to find a variety of errors. Among them are: going beyond the array boundary, memory leaks, always true / false conditions, null pointer dereferencing, and so on.

    Also, data analysis can be used to search for situations when using untested data that came into the program from outside. An attacker can prepare such a set of input data to make the program function in the way he needs. In other words, it can use the lack of input control error as a vulnerability. To search for the use of unverified data in PVS-Studio, the specialized diagnostics V1010 has been implemented and continues to be improved .

    Analysis of the data stream ( Data-Flow Analysis) is to calculate the possible values ​​of variables at various points in the computer program. For example, if the pointer is dereferenced, and it is known that at this moment it may be zero, then this is an error, and the static analyzer will report it.

    Let's look at a practical example of using data flow analysis to find errors. Before us is a function from the Protocol Buffers (protobuf) project, designed to validate the date.

    staticconstint kDaysInMonth[13] = {
      0, 31, 28, 31, 30, 31, 30, 31, 31, 30, 31, 30, 31
    };
    boolValidateDateTime(const DateTime& time){
      if (time.year   < 1 || time.year   > 9999 ||
          time.month  < 1 || time.month  > 12 ||
          time.day    < 1 || time.day    > 31 ||
          time.hour   < 0 || time.hour   > 23 ||
          time.minute < 0 || time.minute > 59 ||
          time.second < 0 || time.second > 59) {
        returnfalse;
      }
      if (time.month == 2 && IsLeapYear(time.year)) {
        return time.month <= kDaysInMonth[time.month] + 1;
      } else {
        return time.month <= kDaysInMonth[time.month];
      }
    }

    The PVS-Studio analyzer detected two logical errors in the function at once and produces the following messages:

    • V547 / CWE-571 Expression 'time.month <= kDaysInMonth [time.month] + 1' is always true. time.cc 83
    • V547 / CWE-571 Expression 'time.month <= kDaysInMonth [time.month]' is always true. time.cc 85

    Pay attention to the subexpression “time.month <1 || time.month> 12 ". If the month value is outside the range [1..12], then the function stops its operation. The analyzer takes this into account and knows that if the second if statement starts , then the month value exactly lies in the range [1..12]. Similarly, he knows about the range of other variables (year, day, etc.), but they are not interesting to us now.

    Now let's take a look at two identical access operators to the array elements: kDaysInMonth [time.month] .

    The array is set statically, and the analyzer knows the values ​​of all its elements:

    staticconstint kDaysInMonth[13] = {
      0, 31, 28, 31, 30, 31, 30, 31, 31, 30, 31, 30, 31
    };

    Since the months are numbered from 1, the analyzer does not consider 0 at the beginning of the array. It turns out that a value in the range [28..31] can be extracted from the array.

    Depending on whether the year is a leap year or not, 1 is added to the number of days. But this is also not interesting for us now. The comparisons themselves are important:

    time.month <= kDaysInMonth[time.month] + 1;
    time.month <= kDaysInMonth[time.month];

    The range [1..12] (month number) is compared to the number of days in the month.

    Taking into account that in the first case, the month is always February ( time.month == 2 ), we get that the following ranges are compared:

    • 2 <= 29
    • [1..12] <= [28..31]

    As you can see, the result of the comparison is always true, which is what the PVS-Studio analyzer warns about. Indeed, the code contains two identical typos. The left side of the expression should use the day class member , not the month at all .

    The correct code should be:

    if (time.month == 2 && IsLeapYear(time.year)) {
      return time.day <= kDaysInMonth[time.month] + 1;
    } else {
      return time.day <= kDaysInMonth[time.month];
    }

    The error considered here was also previously described in the article " February 31. "

    Symbolic Execution


    In the previous section, the method was considered when the analyzer calculates the possible values ​​of variables. However, to find some errors, knowing the values ​​of variables is not necessary. Symbolic execution ( Symbolic Execution ) involves the solution of equations in symbolic form.

    I did not find a suitable demo in our database of errors , therefore we will consider a synthetic code sample.

    intFoo(int A, int B){
      if (A == B)
        return10 / (A - B);
      return1;
    }

    The PVS-Studio analyzer issues a V609 / CWE-369 Divide by zero warning. Denominator 'A - B' == 0. test.cpp 12

    The values ​​of the variables A and B are unknown to the analyzer. But the analyzer knows that at the moment of calculating the expression 10 / (A - B) the variables A and B are equal. Therefore, a division by 0 will occur.

    I said that the values ​​of A and B are unknown. For the general case, this is true. However, if the analyzer sees a function call with specific values ​​of the actual arguments, then it will take this into account. Consider an example:

    intDiv(int X){
      return10 / X;
    }
    voidFoo(){
      for (int i = 0; i < 5; ++i)
        Div(i);
    }

    PVS-Studio analyzer warns about division by zero: V609 CWE-628 Divide by zero. Denominator 'X' == 0. The 'Div' function processes value '[0..4]'. Inspect the first argument. Check lines: 106, 110. consoleapplication2017.cpp 106

    A mixture of technologies is already working here: data flow analysis, symbolic execution and automatic annotation of methods (we will look at this technology in the next section). The analyzer sees that the variable X is used in the Div function as a divisor. Based on this, a special annotation is automatically generated for the div function . Further, it is taken into account that the value range [0..4] is passed to the function as the argument X. The analyzer concludes that a division by 0 should occur.

    Method Annotations


    Our team annotated thousands of functions and classes provided in:

    • WinAPI
    • standard library C,
    • Standard Template Library (STL)
    • glibc (GNU C Library)
    • Qt
    • MFC
    • zlib
    • libpng
    • Openssl
    • and so on

    All functions are annotated manually, which allows you to set many characteristics that are important in terms of search errors. For example, it is specified that the buffer size passed to the fread function must be no less than the number of bytes that are planned to be read from the file. Also indicated is the relationship between the 2nd, 3rd arguments and the value that the function can return. It all looks like this:

    PVS-Studio: markup of functions

    Thanks to this annotation, the following code, in which the fread function is used , will immediately reveal two errors.

    voidFoo(FILE *f){
      char buf[100];
      size_t i = fread(buf, sizeof(char), 1000, f);
      buf[i] = 1;
      ....
    }

    PVS-Studio warnings:
    • V512 CWE-119 A call of the fread function will guide you to the bump. test.cpp 116
    • V557 CWE-787 Array overrun is possible. The value of 'i' index could reach 1000. test.cpp 117

    First, the analyzer multiplied the 2nd and 3rd actual argument and calculated that the function can read up to 1000 bytes of data. In this case, the buffer size is only 100 bytes, and it may overflow.

    Secondly, since the function can read up to 1000 bytes, the range of possible values ​​for the variable i is [0..1000]. Accordingly, an array may be accessed at an incorrect index.

    Let's look at another simple example of an error that has become possible due to the markup of the memset function . Before us is a fragment of the code of the project CryEngine V.

    voidEnableFloatExceptions(....){
      ....
      CONTEXT ctx;
      memset(&ctx, sizeof(ctx), 0);
      ....
    }

    The PVS-Studio analyzer found a typo: V575 The 'memset' function processes '0' elements. Inspect the third argument. crythreadutil_win32.h 294 The

    2nd and 3rd function argument is mixed up. As a result, the function processes 0 bytes and does nothing. The analyzer notices this anomaly and warns programmers about it. We have previously described this error in the article "The long-awaited check of CryEngine V ".

    The PVS-Studio analyzer is not limited to annotations given by us manually. In addition, he independently tries to create annotations, studying the body functions. This allows you to find errors of improper use of functions. For example, the analyzer remembers that a function can return nullptr. If the pointer that returned this function is used without a preliminary check, the analyzer will warn about it. Example:

    int GlobalInt;
    int *Get(){
      return (rand() % 2) ? nullptr : &GlobalInt;
    }
    voidUse(){
      *Get() = 1;
    }

    Warning: V522 CWE-690 There might be a de-referent of a potential pointer 'Get ()'. test.cpp 129

    Note. The search for the error just considered can be approached in the opposite way. Not to memorize anything, but each time when the Get function call is encountered , analyze it knowing the actual arguments. This algorithm theoretically allows you to find more errors, but it has an exponential complexity. The time of program analysis grows hundreds or thousands of times, and we consider such an approach a dead end from a practical point of view. In PVS-Studio, we are developing the direction of automatic annotation of functions.

    Pattern Matching (pattern-based analysis)


    The pattern matching technology, at first glance, may seem to be a search using regular expressions. In fact, it is not, and everything is much more complicated.

    First, as I said , regular expressions are no good at all. Secondly, the analyzers do not work with text lines, but with syntactic trees, which makes it possible to recognize more complex and high-level error patterns.

    Consider two examples, one simpler and one more complicated. I found the first error by checking the Android source code.

    void TagMonitor::parseTagsToMonitor(String8 tagNames) {
      std::lock_guard<std::mutex> lock(mMonitorMutex);
      if (ssize_t idx = tagNames.find("3a") != -1) {
        ssize_t end = tagNames.find(",", idx);
        char* start = tagNames.lockBuffer(tagNames.size());
        start[idx] = '\0';
        ....
      }
      ....
    }

    The PVS-Studio analyzer recognizes the classic error pattern associated with the programmer's misconception about the priority of operations in the C ++ language: V593 / CWE-783 Consider reviewing the A = B! = C 'kind. The expression is calculated as the following: 'A = (B! = C)'. TagMonitor.cpp 50 Let's take a

    close look at this line:

    if (ssize_t idx = tagNames.find("3a") != -1) {

    The programmer assumes that the assignment is performed at the beginning, and only then the comparison with -1 . In fact, the comparison occurs first. Classic. This error is discussed in more detail in the article on testing Android (see the “Other errors” chapter).

    Now consider the higher-level version of the pattern matching.

    staticinlinevoidsha1ProcessChunk(....){
      ....
      quint8 chunkBuffer[64];
      ....
    #ifdef SHA1_WIPE_VARIABLES
      ....
      memset(chunkBuffer, 0, 64);
    #endif
    }

    PVS-Studio warning: V597 CWE-14 The compiler could delete the memset function call, which is used to flush the chunkBuffer buffer. The RtlSecureZeroMemory () function should be used to erase the private data. sha1.cpp 189

    The essence of the problem is that after filling the buffer with zeros using the memset function, this buffer is not used anywhere. When building code with optimization flags, the compiler will decide that this function call is redundant and remove it. It has the right to this, since from the point of view of the C ++ language, a function call does not have any observable behavior for the program operation. Immediately after filling the buffer chunkBuffer function sha1ProcessChunkfinishes work. Since the buffer is created on the stack, after exiting the function it becomes unavailable for use. Therefore, from the point of view of the compiler, it makes no sense to fill it with zeros.

    As a result, private data will remain on the stack somewhere, which can cause trouble. This topic is discussed in more detail in the article " Safe cleaning of private data ".

    This is an example of a high-level pattern matching. First, the analyzer must be aware of the existence of this security defect, classified according to Common Weakness Enumeration as CWE-14: Compiler .

    Secondly, it must find in the code all the places where the buffer is created on the stack, erased using the memset function and then never used.

    Conclusion


    As you can see, static analysis is a very interesting and useful methodology. It allows you to eliminate at the earliest stages a large number of errors and potential vulnerabilities (see SAST ). If you are still not completely penetrated with static analysis, then I invite you to read our blog , where we regularly analyze errors found by PVS-Studio in various projects. You just can not stay indifferent.

    We will be happy to see your company among our customers and help make your applications better, more reliable and safer.



    If you want to share this article with an English-speaking audience, then please use the link to the translation: Andrey Karpov. Technologies used in the PVS-Studio code analyzer for finding bugs and potential vulnerabilities .

    Also popular now: