For those who need an analysis of C ++ code in an IT startup

The article describes the open and free VivaCore library, which allows you to parse and analyze code in C / C ++. The library can be useful for developers starting their startups in the field of creating such tools as building documentation for the code, specific language extensions, counting metrics, and so on.

Instead of entry. Stop discussing saws, let's create something better

As part of Intel’s blog, I want to support the theme of creating “complex” IT startups. I’m more interested in hearing and discussing just such topics. Perhaps over time, the ISN platform will become just such a place where the discussion and the emergence of new startups in the field of parallel programming, the creation of their own libraries and other technologies of this kind will begin. And maybe on Khabrahabar there will be fewer posts about the achievements of foreign companies and about cuts in ours, and there will be more interesting articles about his own experience, about who does what and how he does it.

VivaCore's place in the universe

Immediately clearly identify the location of the VivaCore library . VivaCore library was created on the basis of another open library OpenC ++ , which has not been developed for many years [1]. VivaCore is a set of patches, fixes, crutches, props and props that allowed the OpenC ++ library to parse the source code compiled by Visual C ++ 8.0, 9.0, 10.0. That is, to support specific extensions, as well as new constructs that appeared in C ++ 0x and implemented in Visual C ++ 10.0 [2].

VivaCore is not a complete library. We develop VivaCore as a mechanism on which we create static code analyzers included in PVS-Studio. That is, VivaCore is this part of the PVS-Studio project, which we decided to make open in order to enable others to use the improved and extended version of OpenC ++.

VivaCore has a lot of disadvantages. Analysis of some template constructions is not completely possible, there is no documentation, it is impossible to work with code in Unicode format, and so on. If you need full-fledged support for parsing and analyzing the C / C ++ language, or you plan to create your own compiler / development environment, then the VivaCore library is not suitable for you. In this case, you should use professional libraries, for example - EDG. The license price of such a library varies between $ 40,000 and $ 250,000 per year. This price is too high for many startup projects that are not ready for such investments, without confidence in the success of the project. In this case, VivaCore library can be a good compromise. It is free, and although not perfect, but at a fairly good level allows you to work with C / C ++. This level will be sufficient for most code tools.

It should be mentioned that there is an opportunity to try to build your project on the basis of GCC source codes or other open systems. These options will also have both strengths and weaknesses. Something will be easier to implement, something more complicated. Legal aspects also make their adjustments. I just want to talk about the VivaCore library, and research, comparison and selection is left to the reader.

If VivaCore library is still of interest to you, then this article will allow you to get to know it better and talk about possible areas of its application. Archive with source code (project for VS2010) is available for download here: http://www.viva64.com/en/vivacore-library/. We update it from time to time. I just want to ask in advance not to overwhelm me with various questions on working with VivaCore, if it is related to training. If you have a full-fledged project, then of course I will try to suggest or we can conclude an agreement on the implementation of the functionality you require. I want to insure against the influx of student questions that occurs 2 times a year when they are given assignments for term papers / diplomas related to the creation of parsers and the like. :)

Key Terms

Before proceeding, I will briefly give a definition of some terms.

Preprocessing is a mechanism that looks at the input ".c / .cpp" file, executing preprocessor directives in it, including the contents of other files specified in the #include directives and so on. The result is a file that does not contain preprocessor directives, all used macros are expanded, the contents of the corresponding files are substituted for the #include directives. A preprocessing result file usually has the suffix ".i". The result of the preprocessing is called a translation unit.

Parsing- This is the process of analyzing the input sequence of characters in order to parse the grammatical structure. Usually parsing is divided into two levels: lexical analysis and grammar analysis.

Lexical analysis is the process of processing an input sequence of characters in order to obtain a sequence of characters called tokens (or "tokens") at the output. Each token can conditionally be represented as a structure containing the type of token and, if necessary, the corresponding value. For C ++, the tokens are "class", "int", "-", "{" and so on.

Grammar analysis (grammar analysis) is the process of comparing the linear sequence of lexemes (words) of a language with its formal grammar.

Abstract Syntax Tree (AST) is a finite, labeled, oriented tree in which the internal vertices are mapped to the operators of the programming language, and the leaves are mapped to the corresponding operands. Thus, leaves are empty operators and represent only variables and constants. An abstract syntax tree differs from a parse tree (DT or parse tree - PT) in that there are no nodes for those syntax rules that do not affect the semantics of the program. Grouping brackets are a classic example of this absence, since in AST the grouping of operands is explicitly defined by the structure of the tree.

Metaprogramming- the creation of programs that create other programs as a result of their work, or modify or supplement themselves during execution [3]. In metaprogramming, two main areas can be distinguished: code generation and self-modifying code. Next, we will consider metaprogramming as generating source code in C / C ++.

Syntax tree traversal - traversal of all vertices and leaves of the syntax tree in order to collect various kinds of information, analysis or modification.

What is VivaCore?

VivaCore library is an open-source project built on the basis of an older library - OpenC ++ (OpenCxx). The VivaCore library is implemented in C ++ and is a project intended for compilation in Visual Studio 2010. However, no specific Visual C ++ compiler extension is used and after a little adaptation the project can be built by another modern compiler.

The VivaCore library was created and developed by the staff of OOO Program Verification Systems. The VivaCore code analysis library has a certificate of state registration of computer programs N 2008610480.

You can use VivaCore library free and free. The only licensing restriction is the need to indicate that your project is based on the OpenC ++ libraries and its extension - VivaCore.

First of all, the VivaCore library may be of interest to small companies (startups) that create or plan to create tools for working with code. It’s naturally not possible to list all of the acceptable areas and methods of application, but I’ll still name a number of directions to show VivaCore from different angles. In brackets, examples of products related to this class of solutions are indicated as explanations. So, using VivaCore it is possible to develop:

code refactoring tools (VisualAssist, DevExpress Refactoring, JetBrains Resharper);
static analyzers for general and specialized purposes (Viva64, lint, Gimpel Software PC-Lint, Parasoft C ++ test);
dynamic code analyzers (Compuware BoundsChecker, AutomatedQA AQTime);
extensions of C / C ++ languages, including support for metaprogramming (OpenTS);
automated code testing (Parasoft C ++ test)
code transformations, for example, for optimization;
syntax highlighting (Whole Tomato Software VisualAssist, any modern development environment);
code documentation building systems (Synopsis, Doxygen);
tools for controlling changes in the source code or analyzing the evolution of changes;
search for duplicate code at the level of grammatical constructions of the language;
counting metrics (C and C ++ Code Counter - UDP);
support for coding standards (Gimpel Software PC-Lint);
tools that facilitate the migration of code to other software and hardware platforms (Viva64);
automatic code generation;
code visualizers, systems for building dependency diagrams (Source-Navigator, CppDepend);
code formatting (Ocher SourceStyler).

Difference of VivaCore library from OpenC ++ library

The main difference between the VivaCore library and OpenC ++ is that it is a living project and continues to actively increase functionality. The OpenC ++ library, unfortunately, has not been developed for a long time. The latest library change dates back to 2004. And the last change related to the support of new keywords dates back to 2003. This fix is a failed attempt to add the wchar_t data type, which introduced five errors of various types.

We list the new key functionalities implemented in the VivaCore library compared to OpenC ++:

The classic C language is supported. A different set of tokens is used, which makes it possible to name variables with the name “class” or declare a function in the classic C style: PureC_Foo (ptr) char * ptr; {...}.
A lot of work has been done to support the specific features of the C ++ language syntax used in development in the VisualStudio 2005/2008/2010 environment. For example, the library processes the keywords __noop, __if_exists, __ptr32, __pragma, __interface and so on.
Some new constructs are supported that are available in the C ++ language standard of 1998, but did not manage to get into OpenC ++. In particular, support for calling template functions using the word template: object.template foo <int> () ;.
The C ++ 0x language standard is supported at the level at which Visual C ++ and Intel C ++ compilers support it.
The calculation of the values of literal constants is implemented.
The library is adapted and optimized to work on 64-bit systems.
Fixed a large number of errors and shortcomings. There are a lot of them and listing them here, there is no impossibility.
Parsing OpenMP directives is supported. True, most of the work with them is performed by VivaMP code, which is absent in VivaCore. But if that, write - we will prompt, we will help.
Implemented coding of long types. Previously, any type was encoded with a special string no longer than 127 characters, which at times was not enough. As a result, on libraries such as boost or loki, the OpenC ++ library “went crazy” and worked incorrectly.

General structure of VivaCore library

The general functional structure of the VivaCore library is shown in Figure 1. Figure 1. General structure of the VivaCore library. Consider the functional blocks of the library in the order in which they process the source code of the program received at the input, as shown in Figure 2. Consider what actions each functional block performs, what information it can be obtained and how it can be modified for specific purposes. Figure 2 - The sequence of code processing.

Figure 2 - The sequence of code processing.

1) Input subsystem

VivaCore library can correctly use only the source C / C ++ code previously processed by the preprocessor. Now, to generate the preprocessed files in PVS-Studio, the Visual C ++ compiler is used. After its work, a processed file with the extension “i” is obtained, with which VivaCore works.

In certain cases, you can submit raw C / C ++ files to the input, but in this case you should work with VivaCore no further than the level of file splitting into tokens. This may well be enough to count metrics or other purposes. But you should not try to build and analyze a parse tree (PT), since the result is likely to be of little use for processing.

Having the preprocessed code, the user can transfer it to the data input subsystem in the form of a file or buffer in memory. The purpose of the input subsystem is to arrange the transferred data in the internal structures of the VivaCore library. The input subsystem also accepts configuration data that reports what is considered system libraries and what are user libraries.

2) Preprocessor subsystem

I would like to emphasize that this subsystem does not perform preprocessing of the code in its classical sense. As mentioned earlier, the preprocessed code should already be submitted to the VivaCore library. The considered subsystem serves for the following tasks:

Splitting the program text into lines and breaking them into two logical groups. The first group includes system code (compiler library code, and so on). To the second user code, which is of interest for analysis. As a result, when developing a static analyzer, the user gets the opportunity to decide whether he will analyze the code of the system libraries or not.
Specialized modification of program text in memory. An example is the removal of constructs from a specific development environment from code that are not related to the C or C ++ languages. For example, the Viva64 analyzer removes such key constructs as SA_Success or SA_FormatString, which are available in the Visual Studio header files and are intended for the static analyzer built into Visual Studio (we are talking about Code Analysis for C / C ++).

3) Lexical analyzer (Lexer)

So we got to those levels of data processing that are of practical interest to developers. Having analyzed the code into tokens, the user has the opportunity to count many metrics, implement a specific syntax highlighting algorithm in various applications.

VivaCore lexical analyzer parses the program text into a set of objects of the Token type (see the Token.h file), which contain information about the type of token, its location in the program text, and length. Token types are listed in the tokennames.h file. Examples of types of tokens:

CLASS - a keyword of the language "class"

WCHAR - a keyword of the language "wchar_t"

If necessary, the user can expand the set of tokens. This may be in demand if specific syntax of a particular language implementation is supported or when developing your own language extension.

When adding tokens, you need to declare them in the tokennames.h file and add to the tables “table” / “tableC” / tableC0xx in the Lex.cc file. The first table is for processing C ++ files, and the second for C files, the third for C ++ 0x. The reason for having multiple tables is due to the fact that the set of tokens in C, C ++, C ++ 0x is different. For example, in C language there is no CLASS token, since in C the word "class" is not a key word and can denote the name of a variable.

As an experiment, learning VivaCore, or practical purposes, you can get the list of tokens in the form of unstructured text or using the DumpEx function in the following formatted form:

258 LC_ID 5
258 lc_id 5
91 [1
262 6 1
93] 1
59; one
303 struct 6
123 {1
282 char 4
42 * 1
258 locale 6

4) Grammar analyzer (Parser)

The grammar analyzer is designed to build a derivation tree (DT), which can be further analyzed and transformed. Please note that the grammar analyzer of the VivaCore library does not build an abstract syntax tree (AST), but a parse tree. This makes it easier to implement support for metaprogram constructions that can be added by the user to the C or C ++ language.

Building a tree in the VivaCore library occurs in the functions of the Parser class. Nodes and leaves of a tree are objects whose classes are inherited from the base classes NonLeaf and Leaf. Figure 3 shows part of the class hierarchy used to represent the tree. Figure 3. Part of the class hierarchy used to build the parse tree.

Figure 3. Part of the class hierarchy used to build the parse tree.

As you can see from the figure, the Ptree class is the base class for everyone else and serves to organize a single interface for working with other classes. The Ptree class has a set of pure virtual functions implemented in descendants. For example, the function "virtual bool IsLeaf () const = 0;" is implemented in the NonLeaf and Leaf classes. In practice, classes only implement this function and are needed to make the hierarchy of classes more logical and beautiful.

Since working with a tree takes up a significant amount of the library, Ptree has a large set of functions for working with tree nodes. For convenience, these functions are analogues of the functions of working with lists in the Lisp language. Here are some of them: Car, Cdr, Cadr, Cddr, LastNth, Length, Eq.

To get a general idea of the operation of the grammar analyzer, as an example, we give a parse tree, which will be built from the following code:

int MyFoo (const float value)
{
  if (value <1.0)
    return sizeof (unsigned long *);
  return value * 4.0f <10.0f? 0-1
}

Unfortunately, the whole parsing tree cannot be represented, therefore, we will depict it in parts in Figures 4.1-4.4. Figure 4.1. Color codes for semantic tree nodes. Figure 4.2. Presentation of the function header. Figure 4.3. Representation of body function. Figure 4.4. Representation of body function.

Figure 4.1. Color codes for semantic tree nodes.

Figure 4.2. Presentation of the function header.

Figure 4.3. Representation of body function.

Figure 4.4. Representation of body function.

Another important component of the analyzer's work should be mentioned. This is obtaining information about the types of various objects (functions, variables, and so on) that is implemented in the Encoding class. Type information is presented as a specially encoded string, the format of which can be found in the Encoding.cc file. There is also a special TypeInfo class in the library that allows you to retrieve type information. For example, using functions such as IsFunction, IsPointerType, IsBuiltInType, you can easily identify the type of element being processed.

A description of approaches to adding new types of nodes or leaves is a non-trivial task and cannot be outlined in this review article. A rational solution is to choose one of the classes, for example, PtreeExprStatement and view all the places in the code where the objects of this class are created, work with them, and so on.

The parse tree obtained at the end can be saved in the ".s / .cpp" file format, which, however, makes little sense. This feature will make sense after changing the parse tree, which can happen in the next steps. Having saved the tree now in the form of program code, we get exactly what we got at the input. However, this can be quite useful for testing changes made to the lexer and parser.

Of great interest is the ability to save the tree for further processing in an arbitrary format implemented by the user. An example is the following textual representation of the code that was given earlier:

PtreeDeclaration: [
  0
  NonLeaf: [
    LeafINT: int
  ]
  PtreeDeclarator: [
    Leaf: MyFoo
    Leaf :(
    NonLeaf: [
      NonLeaf: [
        NonLeaf: [
          LeafCONST: const
          NonLeaf: [
            LeafFLOAT: float
          ]
        ]  
        PtreeDeclarator: [
          Leaf: value
        ]
      ]
    ]
    Leaf :)
  ]
  [{  
    NonLeaf: [
      PtreeIfStatement: [
        LeafReserved: if
        Leaf :(
        PtreeInfixExpr: [
          LeafName: value
          Leaf: <
          Leaf: 1.0
        ]
        Leaf :)
        PtreeReturnStatement: [
          LeafReserved: return
          PtreeSizeofExpr: [
            Leaf: sizeof
            Leaf :(
            NonLeaf: [
              NonLeaf: [
                LeafUNSIGNED: unsigned
                LeafLONG: long
              ]
              PtreeDeclarator: [
                Leaf: *
              ]
            ]
            Leaf :)
          ]
          Leaf :;
        ]
      ]
      PtreeReturnStatement: [
        LeafReserved: return
        PtreeCondExpr: [
          PtreeInfixExpr: [
            PtreeInfixExpr: [
              LeafName: value
              Leaf: *
              Leaf: 4.0f
            ]
            Leaf: <
            Leaf: 10.0f
          ]
          Leaf :?
          Leaf: 0
          Leaf ::
          Leaf: 1
        ]
        Leaf :;
      ]
    ]
    Leaf:}
  }]
]

This format is shown just for an example.

5) Bypassing the parse tree

For developers of static code analyzers or code documentation building systems, the most interesting step is to go through the parse tree using the Walker, ClassWalker, and ClassBodyWalker classes. You can bypass the parse tree several times, which allows you to create systems that modify the code in several passes, or to analyze, taking into account already accumulated knowledge during previous tree walks.

The Walker class is used to bypass the basic C / C ++ language constructs.

The ClassWalker class inherits from the Walker class and adds functionality related to the specifics of the classes present in C ++.

Note. To be honest, even in OpenC ++ the functionality of these classes was mixed, and in VivaCore the Walker and ClassWalker classes have grown even more. They can be combined into one, but there is no sense from such work.

When it is necessary to parse the class body, objects of the ClassBodyWalker class are temporarily created and used.

If you do not make any changes to the VivaCore library, then there will be a simple pass through all the elements of the tree. In this case, the tree itself will not change.

If the user implements functionality that will modify the vertices of the tree, the library can rebuild the tree. For example, consider the code that translates unary operations:

Ptree * ClassWalker :: TranslateUnary (Ptree * exp)
{
  using namespace PtreeUtil;
  Ptree * unaryop = exp-> Car ();
  Ptree * right = PtreeUtil :: Second (exp);
  Ptree * right2 = Translate (right);
  if (right == right2)
    return exp;
  else
    return
      new (GC_QuickAlloc)
      PtreeUnaryExpr (unaryop, PtreeUtil :: List (right2));
}

Please note that if, by translating the expression to the right of the unary operation, the resulting tree will be changed, then the unary operation node will be changed (recreated). Which in turn may entail the restructuring of higher nodes.

For clarity, consider this example in more detail.

Processing of the node, which is a unary operation on some expression and of the PtreeUnaryExpr type, begins. The first element in the list that is retrieved using the exp-> Car () operation is the directly unary operation. The second element extracted using PtreeUtil :: Second (exp) is the expression to which the unary operation is applied.

The expression is translated and the result is placed in the variable right2. If this address is different from the existing one, then this means that the expression has been changed. In this case, a new object of type PtreeUnaryExpr is created, which will be returned from the TranslateUnary function. Otherwise, nothing changes and the same object is returned as it entered.

If the user needs to collect information when traversing the tree, or to modify it, it will most naturally inherit from the ClassWalker and ClassBodyWalker classes.

Let us show the simplest example taken from the Viva64 static analyzer, in which specialized analysis occurs when passing through the throw operator:

Ptree * VivaWalker :: TranslateThrow (Ptree * p) {
  Ptree * result = ClassWalker :: TranslateThrow (p);
  Ptree * oprnd = PtreeUtil :: Second (result);
  // If oprnd == nullptr, then this is "throw;".
  if (oprnd! = nullptr) { 
    if (! CreateWiseType (oprnd)) {
      return result;
    }
    if (IsErrorActive (115) &&
        ! ApplyRuleN10 (oprnd-> m_wiseType.m_simpleType))
    {
      AddError (VivaErrors :: V115 (), p, 115);
    }
  }
  return result;
}

First, using ClassWalker :: TranslateThrow (p), a standard node translation is performed. Then the necessary analysis is performed. Everything is simple and very elegant.

Speaking about tree traversal, it should also be said about the very important class Environment, which provides information about the types of various objects in different scopes.

An example of using the Environment class represented by an env object to get the declTypeInfo object type:

TypeInfo declTypeInfo;
if (env-> Lookup (decl, declTypeInfo)) {
  ...
}

6) Support for metaprogramming

There are languages for which metaprogramming is naturally an integral part. An example is the Nemerle language, which can be found in the article “Metaprogramming in Nemerle” [4]. But in the case of C / C ++, everything is more complicated, and in them metaprogramming is implemented in the following two ways:

Templates in C ++ and a preprocessor in C. This path has many limitations.
External language tools. The generator language is compiled in such a way that automatically or with minimal effort on the part of the programmer, implement the paradigm rules or the necessary special functions. In fact, a higher-level programming language is being created. The VivaCore library can be used to create such a system.

The OpenC ++ library, on which VivaCore is built, was originally conceived specifically for converting C ++ code. The library was part of a system that allows you to use a specific version of the C ++ language.

Also, based on OpenC ++, the OpenTS runtime environment for the T ++ programming language was created at the Institute of Software Systems of the Russian Academy of Sciences . This is a C ++ language, in which additional constructions are introduced for automatic parallelization of code sections. For simplicity, I will call it a kind of OpenMP technology analog. This example demonstrates the possibility of using the OpenC ++ library for metaprogramming tasks. VivaCore library inherited these features accordingly.

Under metaprogramming within the VivaCore library, one should understand the possibility of expanding the syntax and functionality of the C / C ++ language in order to create its own programming language. The new metalanguage can be implemented as an intermediate link between the preprocessor and the compiler. In the general case, a functioning scheme can be represented as shown in Figure 5. Figure 5. Participation of the metalanguage translator in the compilation process.

Figure 5. The participation of the metalanguage translator in the compilation process.

VivaCore allows you to convert the program as follows. A parsing tree is being built. Then there is a traversal of tree nodes and those nodes that are constructs of a new language turn into C / C ++ language constructs. New subtrees are built with the necessary functionality. In this case, the parent nodes begin to point not to nodes with metalanguage constructs, but to these created subtrees from C / C ++ language elements (see also “Walking the Parse Tree” above).

If necessary, there can be several such passages, which allows you to create some new language constructs from other new language constructs.

After processing, the new tree can be saved in the text of the program in C / C ++, which will then be assembled using the compiler.

Note. In order not to deceive readers, I must say right away that unfortunately this is all theoretically. We did not test how many of our edits and enhancements to OpenC ++ affected the mechanisms for transforming programs. We do not use this mechanism and, as a result, do not test. Unfortunately, I am sure that there are errors and deficiencies in it, as a result of which the text of the program at the output will not correspond to the text of the program at the input. I am a practitioner, I do not believe in luck in such matters and I know that mistakes should be there. Therefore, if you begin to create a transformation tool, be prepared for this and do not scold us. Better write, maybe together we will improve the world (in the sense of a library).

For more information on metaprogramming, refer to the available documentation for the OpenC ++ library [5].

7) Saving results

You can save the necessary information at any stage of the source code processing inside the VivaCore library. In particular, we mentioned that the resulting and modified parse tree can be saved as program text or in any other format. We will not repeat. It is also clear that one can approach the task of collecting the necessary information, for example, during static analysis or counting metrics, in a variety of ways.

VivaVisualCode Demo Project

To more clearly show how you can use the VivaCore library, we created a demo project VivaVisualCode, available for download at http://www.viva64.com/en/vivacore-library/ .

VivaVisualCode graphically displays the constructed parsing tree and allows you to see some information on its nodes. Figure 6. Example of a parsing tree built by VivaVisualCode for the code “float Value = 10.0 * 20.0;”.

Figure 6. Example of a parsing tree built by VivaVisualCode for the & quot; float Value = 10.0 * 20.0; & quot; code.

Instead of a conclusion

I also want to offer readers an article by Evgeny Zuev " Rare profession " about his experience in developing a compiler. This article has nothing to do with this post, but it is very entertaining, so I recommend it.

Bibliographic list

OpenC ++ library. http://www.viva64.com/go.php?url=16
Andrey Karpov. Static analysis of C ++ code and the new C ++ 0x language standard. http://www.viva64.com/art-2-1-1708094805.html
Jonathan Bartlett. The Art of Metaprogramming, Part 1: Introduction to Metaprogramming. http://www.viva64.com/go.php?url=39
Kamil Skalski. Metaprogramming in Nemerle. http://www.viva64.com/go.php?url=40
Grzegorz Jakack. OpenC ++ - A C ++ Metacompiler and Introspection Library. http://www.viva64.com/go.php?url=41

Tags: