The dangers of using multi-character constants

    Picture 1

    During code analysis, PVS-Studio analyzes the data flow and operates on the values ​​of variables. Values ​​are taken from constants or inferred from conditional expressions. We call them virtual values. Recently, we improved them to work with multi-character constants and this was the reason for creating a new diagnostic rule.

    Introduction


    A multi-character literal is implementation-defined , so different compilers can encode these literals in different ways. For example, GCC and Clang set a value based on the order of characters in a literal, while MSVC moves them depending on the type of character (regular or escape).

    For example, the literal 'T \ x65s \ x74' will be encoded in different ways, depending on the compiler. A similar logic had to be added to the analyzer. As a result, we made a new diagnostic rule V1039 to identify such literals in the code. Such literals are dangerous in cross-platform projects that use multiple compilers for assembly.

    Diagnostics V1039


    Consider an example. The code below, compiled by various compilers, will behave differently:

    #include 
    void foo(int c)
    {
      if (c == 'T\x65s\x74')                       // <= V1039
      {
        printf("Compiled with GCC or Clang.\n");
      }
      else
      {
        printf("It's another compiler (for example, MSVC).\n");
      }
    }
    int main(int argc, char** argv)
    {
      foo('Test');
      return 0;
    }

    A program compiled by different compilers will print different messages to the screen.

    For a project using a specific compiler, this will not be noticeable, but porting can cause problems, so you should replace such literals with simple numeric constants, for example, change 'Test' to 0x54657374.

    To demonstrate the difference between compilers, we write a small utility where sequences of 3 and 4 characters are taken, for example, 'GHIJ' and 'GHI', and their representation in memory after compilation is displayed.

    Utility Code:

    #include 
    typedef int char_t;
    void PrintBytes(const char* format, char_t lit)
    {
      printf("%20s : ", format);
      const unsigned char *ptr = (const unsigned char*)&lit;
      for (int i = sizeof(lit); i--;)
      {
        printf("%c", *ptr++);
      }
      putchar('\n');
    }
    int main(int argc, char** argv)
    {
      printf("Hex codes are: G(%02X) H(%02X) I(%02X) J(%02X)\n",'G','H','I','J');
      PrintBytes("'GHIJ'", 'GHIJ');
      PrintBytes("'\\x47\\x48\\x49\\x4A'", '\x47\x48\x49\x4A');
      PrintBytes("'G\\x48\\x49\\x4A'", 'G\x48\x49\x4A');
      PrintBytes("'GH\\x49\\x4A'", 'GH\x49\x4A');
      PrintBytes("'G\\x48I\\x4A'", 'G\x48I\x4A');
      PrintBytes("'GHI\\x4A'", 'GHI\x4A');
      PrintBytes("'GHI'", 'GHI');
      PrintBytes("'\\x47\\x48\\x49'", '\x47\x48\x49');
      PrintBytes("'GH\\x49'", 'GH\x49');
      PrintBytes("'\\x47H\\x49'", '\x47H\x49');
      PrintBytes("'\\x47HI'", '\x47HI');
      return 0;
    }

    The output of the utility compiled by Visual C ++:

    Hex codes are: G(47) H(48) I(49) J(4A)
                  'GHIJ' : JIHG
      '\x47\x48\x49\x4A' : GHIJ
         'G\x48\x49\x4A' : HGIJ
            'GH\x49\x4A' : JIHG
            'G\x48I\x4A' : JIHG
               'GHI\x4A' : JIHG
                   'GHI' : IHG
          '\x47\x48\x49' : GHI
                'GH\x49' : IHG
             '\x47H\x49' : HGI
                '\x47HI' : IHG

    The output of a utility compiled by GCC or Clang:

    Hex codes are: G(47) H(48) I(49) J(4A)
                  'GHIJ' : JIHG
      '\x47\x48\x49\x4A' : JIHG
         'G\x48\x49\x4A' : JIHG
            'GH\x49\x4A' : JIHG
            'G\x48I\x4A' : JIHG
               'GHI\x4A' : JIHG
                   'GHI' : IHG
          '\x47\x48\x49' : IHG
                'GH\x49' : IHG
             '\x47H\x49' : IHG
                '\x47HI' : IHG

    Conclusion


    V1039 diagnostics was added to the PVS-Studio analyzer version 7.03 , which was recently released. You can download the latest version of the analyzer on the download page .



    If you want to share this article with an English-speaking audience, then please use the link to the translation: Svyatoslav Razmyslov. The dangers of using multi-character constants

    Also popular now: