An example of parsing C ++ code using libclang in Python

    On one personal project in C ++, I needed to get information about the types of objects during the execution of the application. C ++ has a built-in Run-Time Type Information (RTTI) mechanism, and of course the first thought was to use it, but I decided to write my own implementation, because I didn’t want to pull the entire built-in mechanism, because I needed only a small part of its functionality. I also wanted to try to practice new features of C ++ 17, with which I was not particularly familiar.


    In this post I will provide an example of working with the libclang parser in the Python language.


    I will omit the details of releasing my RTTI. The following points are important for us in this case:


    • Each class or structure that can provide information about its type must inherit an interface IRttiTypeIdProvider;
    • In each such class (if it is not abstract), you need to add a macro RTTI_HAS_TYPE_IDthat adds a static field of the type pointer to the object RttiTypeId. Thus, in order to obtain a type identifier, you can write MyClass::__typeIdor call a method getTypeIdon a specific instance of the class during the execution of the application.

    Example:


    #pragma once#include<string>#include"RTTI.h"structBaseNode :public IRttiTypeIdProvider {
        virtual ~BaseNode() = default;
        bool bypass = false;
    };
    structSourceNode :public BaseNode {
        RTTI_HAS_TYPE_ID
        std::string inputFilePath;
    };
    structDestinationNode :public BaseNode {
        RTTI_HAS_TYPE_ID
        bool includeDebugInfo = false;
        std::string outputFilePath;
    };
    structMultiplierNode :public BaseNode {
        RTTI_HAS_TYPE_ID
        double multiplier;
    };
    structInverterNode :public BaseNode {
        RTTI_HAS_TYPE_ID
    };

    It was already possible to work with this, but after a while I needed to get information about the fields of these classes: the name of the field, the offset and the size. To implement all this, you will have to manually form a structure with a description of each field of the class of interest somewhere in the .cpp file. Having written several macros, the description of the type and its fields began to look like this:


    RTTI_PROVIDER_BEGIN_TYPE(SourceNode)
    (
        RTTI_DEFINE_FIELD(SourceNode, bypass)
        RTTI_DEFINE_FIELD(SourceNode, inputFilePath)
    )
    RTTI_PROVIDER_END_TYPE()
    RTTI_PROVIDER_BEGIN_TYPE(DestinationNode)
    (
        RTTI_DEFINE_FIELD(DestinationNode, bypass)
        RTTI_DEFINE_FIELD(DestinationNode, includeDebugInfo)
        RTTI_DEFINE_FIELD(DestinationNode, outputFilePath)
    )
    RTTI_PROVIDER_END_TYPE()
    RTTI_PROVIDER_BEGIN_TYPE(MultiplierNode)
    (
        RTTI_DEFINE_FIELD(MultiplierNode, bypass)
        RTTI_DEFINE_FIELD(MultiplierNode, multiplier)
    )
    RTTI_PROVIDER_END_TYPE()
    RTTI_PROVIDER_BEGIN_TYPE(InverterNode)
    (
        RTTI_DEFINE_FIELD(InverterNode, bypass)
    )

    And this is only for 4 classes. What problems can be identified?


    1. When copying blocks of code manually, you can lose sight of the class name when defining the field (accumulating a block with SourceNode for DestinationNode, but in one of the fields they forgot to change SourceNode to DestinationNode). The compiler will skip everything, the application may not even fall, but the field information will be incorrect. And if you record or read data based on information from such a field, everything will explode (as they say, but I don’t want to check it myself).
    2. If you add a field to the base class, then you need to update ALL entries.
    3. If you change the name or the order of the fields in the class, then you need to remember to update the name and order in this bag of code.

    But the main thing - all this needs to be written manually. When it comes to such a monotonous code, I get very lazy and look for a way to generate it automatically, even if it takes more time and effort than manual writing.


    Python helps me with this, I write scripts on it to solve such problems. But we are dealing not just with template text, but with text built on the basis of C ++ source code. We need a tool to get information about C ++ code, and libclang will help us with this.


    libclang is a high-level C-interface for Clang. Provides APIs for tools to parse source code in an abstract syntax tree (AST), load already analyzed ASTs, bypass ASTs, match locations of a physical source with elements within AST, and other tools from the Clang set.

    As follows from the description, libclang provides a C-interface, and to work with it through Python you need a binding library (binding). At the time of this writing, there is no official such library for Python, but from the unofficial there is this https://github.com/ethanhs/clang .


    Install it through the package manager:


    pip install clang

    The library is provided with comments in the result code. But to understand the libclang device, you need to read the libclang documentation . There are not many examples of using the library, and there are no comments explaining why everything works like this and not otherwise. Those who already had experience with libclang will have fewer questions, but personally I didn’t have that experience, so I had to notably dig in the code and poke around in the debugger.


    Let's start with a simple example:


    import clang.cindex
    index = clang.cindex.Index.create()
    translation_unit = index.parse('my_source.cpp', args=['-std=c++17'])
    for i in translation_unit.get_tokens(extent=translation_unit.cursor.extent):
        print (i.kind)

    This creates a type object Indexthat can parse a file with C ++ code. The method parsereturns an object of type TranslationUnit, it is a unit of translation code. TranslationUnitis an AST node (node), and each AST node stores information about its position in the source code (extent). We cycle through all the tokens in TranslationUnitand print the type of these tokens (the property of kind).


    For example, take the following C ++ code:


    class X {};
    class Y {};
    class Z : public X {};

    Script Execution Result
    TokenKind.KEYWORD
    TokenKind.IDENTIFIER
    TokenKind.PUNCTUATION
    TokenKind.PUNCTUATION
    TokenKind.PUNCTUATION
    TokenKind.KEYWORD
    TokenKind.IDENTIFIER
    TokenKind.PUNCTUATION
    TokenKind.PUNCTUATION
    TokenKind.PUNCTUATION
    TokenKind.KEYWORD
    TokenKind.IDENTIFIER
    TokenKind.PUNCTUATION
    TokenKind.KEYWORD
    TokenKind.IDENTIFIER
    TokenKind.PUNCTUATION
    TokenKind.PUNCTUATION
    TokenKind.PUNCTUATION

    Now let's handle AST. Before writing Python code, let's see what we generally expect from the clang parser. Run the clang in AST dump mode:


    clang++ -cc1 -ast-dump my_source.cpp

    The result of the command
    TranslationUnitDecl 0xaaaa9b9fa8 <<invalid sloc>> <invalid sloc>
    |-TypedefDecl 0xaaaa9ba880 <<invalid sloc>> <invalid sloc> implicit __int128_t '__int128'
    | `-BuiltinType 0xaaaa9ba540 '__int128'
    |-TypedefDecl 0xaaaa9ba8e8 <<invalid sloc>> <invalid sloc> implicit __uint128_t 'unsigned __int128'
    | `-BuiltinType 0xaaaa9ba560 'unsigned __int128'
    |-TypedefDecl 0xaaaa9bac48 <<invalid sloc>> <invalid sloc> implicit __NSConstantString '__NSConstantString_tag'
    | `-RecordType 0xaaaa9ba9d0 '__NSConstantString_tag'
    |   `-CXXRecord 0xaaaa9ba938 '__NSConstantString_tag'
    |-TypedefDecl 0xaaaa9e6570 <<invalid sloc>> <invalid sloc> implicit __builtin_ms_va_list 'char *'
    | `-PointerType 0xaaaa9e6530 'char *'
    |   `-BuiltinType 0xaaaa9ba040 'char'
    |-TypedefDecl 0xaaaa9e65d8 <<invalid sloc>> <invalid sloc> implicit __builtin_va_list 'char *'
    | `-PointerType 0xaaaa9e6530 'char *'
    |   `-BuiltinType 0xaaaa9ba040 'char'
    |-CXXRecordDecl 0xaaaa9e6628 <my_source.cpp:1:1, col:10> col:7 referenced class X definition
    | |-DefinitionData pass_in_registers empty aggregate standard_layout trivially_copyable pod trivial literal has_constexpr_non_copy_move_ctor can_const_default_init
    | | |-DefaultConstructor exists trivial constexpr needs_implicit defaulted_is_constexpr
    | | |-CopyConstructor simple trivial has_const_param needs_implicit implicit_has_const_param
    | | |-MoveConstructor exists simple trivial needs_implicit
    | | |-CopyAssignment trivial has_const_param needs_implicit implicit_has_const_param
    | | |-MoveAssignment exists simple trivial needs_implicit
    | | `-Destructor simple irrelevant trivial needs_implicit
    | `-CXXRecordDecl 0xaaaa9e6748 <col:1, col:7> col:7 implicit class X
    |-CXXRecordDecl 0xaaaa9e6800 <line:3:1, col:10> col:7 class Y definition
    | |-DefinitionData pass_in_registers empty aggregate standard_layout trivially_copyable pod trivial literal has_constexpr_non_copy_move_ctor can_const_default_init
    | | |-DefaultConstructor exists trivial constexpr needs_implicit defaulted_is_constexpr
    | | |-CopyConstructor simple trivial has_const_param needs_implicit implicit_has_const_param
    | | |-MoveConstructor exists simple trivial needs_implicit
    | | |-CopyAssignment trivial has_const_param needs_implicit implicit_has_const_param
    | | |-MoveAssignment exists simple trivial needs_implicit
    | | `-Destructor simple irrelevant trivial needs_implicit
    | `-CXXRecordDecl 0xaaaa9e6928 <col:1, col:7> col:7 implicit class Y
    `-CXXRecordDecl 0xaaaa9e69e0 <line:5:1, col:21> col:7 class Z definition
      |-DefinitionData pass_in_registers empty standard_layout trivially_copyable trivial literal has_constexpr_non_copy_move_ctor can_const_default_init
      | |-DefaultConstructor exists trivial constexpr needs_implicit defaulted_is_constexpr
      | |-CopyConstructor simple trivial has_const_param needs_implicit implicit_has_const_param
      | |-MoveConstructor exists simple trivial needs_implicit
      | |-CopyAssignment trivial has_const_param needs_implicit implicit_has_const_param
      | |-MoveAssignment exists simple trivial needs_implicit
      | `-Destructor simple irrelevant trivial needs_implicit
      |-public 'X'
      `-CXXRecordDecl 0xaaaa9e6b48 <col:1, col:7> col:7 implicit class Z

    Here CXXRecordDeclis the type of the node representing the class declaration. You may notice that there are more such nodes here than the classes in the source file. This is because reference nodes are represented by the same type, i.e. nodes that are links to other nodes. In our case, the indication of the base class is the reference. When disassembling this tree, the reference node can be determined using a special flag.


    Now we will write a script that lists the classes in the source file:


    import clang.cindex
    import typing
    index = clang.cindex.Index.create()
    translation_unit = index.parse('my_source.cpp', args=['-std=c++17'])
    deffilter_node_list_by_node_kind(
        nodes: typing.Iterable[clang.cindex.Cursor],
        kinds: list
    ) -> typing.Iterable[clang.cindex.Cursor]:
        result = []
        for i in nodes:
            if i.kind in kinds:
                result.append(i)
        return result
    all_classes = filter_node_list_by_node_kind(translation_unit.cursor.get_children(), [clang.cindex.CursorKind.CLASS_DECL, clang.cindex.CursorKind.STRUCT_DECL])
    for i in all_classes:
        print (i.spelling)
    

    The class name is stored in the property spelling. For different types of nodes, the value spellingmay contain some type modifiers, but for a class or structure declaration it contains a name without modifiers.


    Result of performance:


    X
    Y
    Z

    When parsing AST clang also parses files connected via #include. Try to add #include <string>to the source, and in the dump you will get 84 thousand lines, which is clearly a bit too much to solve our problem.


    To view the AST dump of such files via the command line, it is better to delete everything #include. Bring them back when you study AST and get an idea of ​​the hierarchy and types in the file of interest.


    In the script, in order to filter only the AST belonging to the source file, and not connected via #include, you can add the following filtering function by file:


    deffilter_node_list_by_file(
        nodes: typing.Iterable[clang.cindex.Cursor],
        file_name: str
    ) -> typing.Iterable[clang.cindex.Cursor]:
        result = []
        for i in nodes:
            if i.location.file.name == file_name:
                result.append(i)
        return result
    ...
    filtered_ast = filter_by_file(translation_unit.cursor, translation_unit.spelling)

    Now you can do field extraction. Below is the full code that generates a list of fields , taking into account inheritance and generates text from the template. There is nothing clang specific, so no comments.


    Full script code
    import clang.cindex
    import typing
    index = clang.cindex.Index.create()
    translation_unit = index.parse('Input.h', args=['-std=c++17'])
    deffilter_node_list_by_file(
        nodes: typing.Iterable[clang.cindex.Cursor],
        file_name: str
    ) -> typing.Iterable[clang.cindex.Cursor]:
        result = []
        for i in nodes:
            if i.location.file.name == file_name:
                result.append(i)
        return result
    deffilter_node_list_by_node_kind(
        nodes: typing.Iterable[clang.cindex.Cursor],
        kinds: list
    ) -> typing.Iterable[clang.cindex.Cursor]:
        result = []
        for i in nodes:
            if i.kind in kinds:
                result.append(i)
        return result
    defis_exposed_field(node):return node.access_specifier == clang.cindex.AccessSpecifier.PUBLIC
    deffind_all_exposed_fields(
        cursor: clang.cindex.Cursor
    ):
        result = []
        field_declarations = filter_node_list_by_node_kind(cursor.get_children(), [clang.cindex.CursorKind.FIELD_DECL])
        for i in field_declarations:
            ifnot is_exposed_field(i):
                continue
            result.append(i.displayname)
        return result
    source_nodes = filter_node_list_by_file(translation_unit.cursor.get_children(), translation_unit.spelling)
    all_classes = filter_node_list_by_node_kind(source_nodes, [clang.cindex.CursorKind.CLASS_DECL, clang.cindex.CursorKind.STRUCT_DECL])
    class_inheritance_map = {}
    class_field_map = {}
    for i in all_classes:
        bases = []
        for node in i.get_children():
            if node.kind == clang.cindex.CursorKind.CXX_BASE_SPECIFIER:
                referenceNode = node.referenced
                bases.append(node.referenced)
        class_inheritance_map[i.spelling] = bases
    for i in all_classes:
        fields = find_all_exposed_fields(i)
        class_field_map[i.spelling] = fields
    defpopulate_field_list_recursively(class_name: str):
        field_list = class_field_map.get(class_name)
        if field_list isNone:
            return []
        baseClasses = class_inheritance_map[class_name]
        for i in baseClasses:
            field_list = populate_field_list_recursively(i.spelling) + field_list
        return field_list
    rtti_map = {}
    for class_name, class_list in class_inheritance_map.items():
        rtti_map[class_name] = populate_field_list_recursively(class_name)
    for class_name, field_list in rtti_map.items():
        wrapper_template = """\
    RTTI_PROVIDER_BEGIN_TYPE(%s)
    (
    %s
    )
    RTTI_PROVIDER_END_TYPE()
    """
        rendered_fields = []
        for f in field_list:
            rendered_fields.append("    RTTI_DEFINE_FIELD(%s, %s)" % (class_name, f))
        print (wrapper_template % (class_name, ",\n".join(rendered_fields)))
    

    This script does not take into account whether the class has RTTI. Therefore, after obtaining the result, you will have to manually remove the blocks describing classes without RTTI. But it is a trifle.


    I hope someone will be useful and save time. All code is posted on GitHub .


    Also popular now: