Pylint from the inside. How he does it

    Different assistants in writing a cool code just surround us, linters, tipchekery, utilities for finding vulnerabilities, all with us. We are accustomed to and use without going into details, like a “black box”. For example, few people understand the principles of Pylint - one of these indispensable tools for optimizing and improving code in Python.

    But Maxim Mazaev knows how important it is to understand his tools, and he told us at  Moscow Python Conf ++ . Using real examples, he showed how knowledge of the internal structure of Pylint and its plug-ins helped reduce the time for code review, improve the quality of the code, and generally improve the efficiency of development. Below is a transcript instruction.



    Why do we need Pylint?


    If you already use it, the question may arise: “Why know what Pylint has inside, how can this knowledge help?”

    Usually, developers write code, launch a linter, get messages on how to improve, how to make the code more beautiful, and make the proposed changes. Now it is more convenient to read the code and it is not a shame to show colleagues.

    For a long time, in CIAN, this was exactly how they worked with Pylint, with a few additions: they changed configurations, removed extra rules, increased the maximum string length.

    But at some point we ran into a problem, to solve which, I had to dig deep inside Pylint and figure out how it works. What is the problem and how it was solved, read on.


    About the speaker: Maxim Mazaev ( backslash ), 5 years in development, works in CIAN. Examines deep Python, asynchrony and functional programming.

    About CIAN


    The majority believes that CIAN is a real estate agency with realtors and are very surprised when they find out that programmers work instead of realtors.

    We are a technical company, in which there are no realtors, but there are a lot of programmers.

    • 1 million unique users per day.
    • The largest bulletin board on selling and renting real estate in Moscow and St. Petersburg. In 2018, we reached the federal level and work across Russia.
    • Nearly 100 people in the development team, of whom 30 write Python code every day.

    Every day, hundreds and thousands of lines of new code go into production. The requirements for code are quite simple:

    • Code of decent quality.
    • Stylistic uniformity. All developers should write approximately similar code, without a “vinaigrette” in the repositories.

    To achieve this, of course, need a code review.

    Code review


    Code review in CIAN takes place in two stages:

    1. The first stage is automated . A Jenkins robot runs tests, runs Pylint and checks API consistency between microservices, since we use microservices. If at this stage the tests fail or the linter shows something strange, then this is the reason for rejecting the pull request and sending the code for revision.
    2. If the first stage is successful, then the second stage comes - approval from  two developers . They can evaluate how good the code is in terms of business logic, approve a pull request, or return the code for revision.


    Problems code review


    Pull request may fail code review due to:

    • errors in business logic, when the developer is ineffective or incorrectly solved the problem;
    • problems with code style.

    What could be the problems with the style if the linter checks the code?

    Everyone who writes in Python knows that there is a guide to writing PEP-8 code . Like any standard, PEP-8 is quite common and for us, as developers, this is not enough. The standard wants to be specified in some places, and expanded in others.

    Therefore, we came up with our own internal arrangements on how the code should look and work, and called them  "Decline Cian Proposals" .



    “Decline Cian Proposals” is a set of rules, now there are about 15. Each of these rules is the basis for a pull request to be rejected and sent for revision.

    What prevents productive code review?


    There is one problem with our internal rules - the linter does not know about them, and it would be strange if he knew - they are internal.
    A developer who performs a task should always remember and keep the rules in mind. If he forgets one of the rules, then in the process of code review, the reviewers will point out the problem, the task will be sent for revision and the release time of the task will increase. After completion and correction of errors, the verifiers need to remember what was in the task, switch the context.

    It creates a problem for both the developer and the verifiers. As a result, the speed of code review is critically reduced. Instead of analyzing the logic of the code, the verifiers begin to analyze the visual style, that is, they do the work of the linter: they scan the code line by line and look for inconsistencies in the indents, in the import format.

    We would like to get rid of this problem.

    And do not write us your linter?


    It seems that the problem will be solved by a tool that will know about all internal agreements and will be able to check the code for their execution. It turns out we need your linter?

    Not really. The idea is stupid because we already use Pylint. This is a convenient linter, it is pleasant to developers and is built into all processes: it runs in Jenkins, generates beautiful reports that are completely satisfied and come in the form of comments in pull request. Everything is fine, the second linter is not needed .

    So how to solve the problem if we don’t want to write our own linter?

    Write a plugin for Pylint


    For Pylint, you can write plugins, they are called checkers. Under each internal rule, you can write your checker, which will check it.

    Consider two examples of such checkers.

    Example number 1


    At some point, it was found that the code contains a lot of comments like “TODO” - promises to refactor, delete unnecessary code or rewrite it nicely, but not now, but later. There is a problem with such comments - they absolutely do not oblige to anything.

    Problem


    The developer wrote a promise, exhaled and went with peace of mind to engage in the next task.


    Eventually:

    • comments with promises hang for years and are not fulfilled;
    • the code is littered;
    • technical debt accumulated over the years.

    For example, the developer promised to remove something 3 years ago after a successful release, but did the release happen in 3 years? It is possible that yes. Does code delete in this case? This is a big question, but most likely not.

    Solution: write your checker for Pylint


    It is impossible to forbid developers to write such comments, but you can make additional work: create a task to refine the promise in the tracker. Then we will not forget about it.

    We need to find all comments of the form TODO and make sure that in each of them there is a link to the task in Jira. Let's write.

    What is a checker in terms of Pylint? This is a class that inherits from the base class of the checker and implements a certain interface.

    classTodoIssueChecker(BaseChecker):
        _ _implements_ _ = IRawChecker
    

    In our case, this is the IRawChecker  - the so-called “raw” checker.

    A raw checker is iterated over file lines and can perform a specific action on a line. In our case, in each line, the checker will look for something similar to a comment and a link to the task.

    For the checker, you need to define a list of messages that he will issue:

    msgs = {
        'С9999': ('Комментарии с TODO без ссылки на задачу',
            issue-code-in-todo', 
            'Длинное описание')}
    

    The message has:

    • the description is short and long;
    • checker code and a short mnemonic name that defines what the message is.

    The message code is “C1234”, in which:

    • The first letter is clearly standardized for different types of messages: [C] onvention; [W] arning; [E] yy; [F] atal; [R] efactoring. Thanks to the letter, the report immediately shows what is happening: a reminder of the agreements or fatal problems that need to be urgently addressed.
    • 4 random numbers unique within Pylint.

    The code is needed to disable the check if it becomes unnecessary. You can write Pylint: disable and a short alphanumeric code or mnemonic name:

    # Pylint: disable=C9999# Pylint: disable=issue-code-in-todo

    The authors of Pylint recommend to abandon the alphanumeric code and use the mnemonic, it is more visual.

    The next step is to define a method called process_module .



    The name is very important. The method should be called that way, because then Pylint will call it.

    The node parameter is passed to the module . In this case, no matter what it is or what type it is, it is only important to remember that the node has a stream method that returns a file line by line.

    You can go through the file and check the presence of comments and links to the task for each line. If there is a comment, but there is no link, then throw out a warning like 'issue-code-in-todo'with checker code and line number. The algorithm is quite simple.

    Register the checker so that Pylint knows about it. This is done by the register function :

    defregister(linter: Pylinter) -> None:
        linter. register_checker ( 
        TodoIssueChecker(linter)
        )
    

    • An instance of the Pylint comes into the function.
    • It calls the register_checker method.
    • We pass the checker itself to the method.

    Important point: the checker module must be in PYTHONPATH, so that Pylint can import it later.

    The registered checker is checked with a test file with comments without reference to the tasks.

    $ cat work.ру
    # T0D0: Удалю через неделю, честно-честно!
    $ pylint work.ру --load-plugins todo_checker
    …
    

    For the test, we launch Pylint, transfer the module to it, transfer the checker using the load-plugins parameter , and launch two phases inside the linter.

    Phase 1. Initialization of plugins


    • All modules where there are plugins are imported. Pylint has internal checkers and external ones. They all come together and imported.
    • Register - module.register (self) . For each checker, the register function is called, to which the Pylint instance is passed.
    • Checks are performed: for the validity of the parameters, for the presence of messages, options and reports in the correct format.

    Phase 2. Disassemble the checker pool


    After phase 1 there is a whole list of checkers of different types:

    • AST checker;
    • Raw checker;
    • Token checker.



    From the list we select those that belong to the interface of the “raw” checker: we look at which checkers implement the IRawChecker interface and take them for ourselves.

    For each selected checker, call the method checker.process_module (module) , and run the check.

    Result


    Run the checker on the test file again:

    $ cat work.ру
    # T0D0: Удалю через неделю, честно-честно!
    $ pylint work, ру --load-plugins todo_checker 
    С: 0,0: Комментарий с T0D0 без ссылки на задачу
    (issue-code-in-todo)
    

    A message appears stating that there is a comment with TODO and no link to the task.

    The problem is solved and now in the process of code review, developers do not need to scan the code with their eyes, find comments, write a reminder to the code author that there is an agreement and it is desirable to leave a link. Everything happens is automated and code review runs a little faster.

    Example No. 2. keyword-arguments


    There are functions that take positional arguments. If there are a lot of arguments, then when they call the function it is not very clear where the argument is and why it is needed.

    Problem


    For example, we have a function:

    get_offer_by_cian_id(
        "sale",
        Тrue,
        859483,
    )
    

    In the code there is sale and  True and it is not clear what they mean. It is much more convenient when functions with many arguments would be called only with named arguments:

    get_offer_by_cian_id( 
        deal_type="sale",
        truncate=True, 
        cian_id=859483, 
    )
    

    This is a good code in which it is immediately clear where a parameter is and we will not confuse their sequence. Let's try to write a checker that checks such cases.

    The “raw” checker used in the previous example is very difficult to write for such a case. You can add super-complex regular expressions, but such code is hard to read. It’s good that Pylint makes it possible to write another type of checker based on the abstract AST syntax tree , and we’ll use it.

    Lyrical digression about AST


    An AST or abstract syntax tree is a representation of code in the form of a tree, where the vertex is the operands, and the leaves are the operators.

    For example, a function call, where there is one positional argument and two named ones, is transformed into an abstract tree:


    Here there is a vertex with the type Call and it has:

    • function attributes with the name func;
    • a list of positional args arguments, where there is a node with type Const and a value of 112;
    • list of named arguments.

    The challenge in this case is:

    • Find in the module all nodes with type Call (function call).
    • Count the total number of arguments that the function takes.
    • If there are more than 2 arguments, then make sure that there are no positional arguments in the node.
    • If there are positional arguments, then show a warning.


    Саll(
        func=Name(name='get_offer'), 
        args=[Const(value=1298880)],
        keywords=[
            …
        ]))]
    

    From the point of view of Pylint, an AST-based checker is a class that inherits from the basic checker class and implements the IAstroidChecker interface :

    classNonKeywordArgsChecker(BaseChecker):
        -_ _implements_ _ = IAstroidChecker
    

    As in the first example, the list of messages contains the checker description, message code, and short mnemonic name:

    msgs = {
        'С9191': ('Краткое описание',
            keyword-only-args',
            'Длинное описание')}
    

    The next step is to define the visit_call method :

    defvisit_call(self, node: Call)

    The method need not be so called. The most important thing in it is the prefix visit_, and then comes the name of the vertex that interests us, with a small letter.

    • The AST parser walks through the tree and for each vertex it looks to see if the check_ interface has a visit_ <Name> interface.
    • If yes, then cause it.
    • Recursively passes through all her children.
    • When leaving the node, calls the method le_ve_ <Name>.

    In this example, the visit_call method will receive a node with a Call type as input and see if it has more than two arguments and positional arguments are present to throw out the warning and pass the code and the node itself.

    defvisit_call(self, n):if node.args and len(node.args + node.keywords) > 2: 
            self.add_message(
                'keyword-only-args', 
                    node=node
            )
    

    We register the checker, as in the previous example: we transfer the Pylint instance, call the register_checker, passing the checker itself, and start it.

    defregister(linter: Pylinter) -> None: 
        linter.register_checker( 
        TodoIssueChecker(linter)
        )
    

    This is an example of a test function call, where there are 3 arguments and only one of them is named:

    $ cat work.ру
    get_offers(1, True, deal_type="sale")
    $ Pylint work.py --load-plugins non_kwargs_checker
    …
    

    This is a function that is potentially called wrong from our point of view. Launch Pylint.

    Phase 1 initialization of plug-ins is completely repeated, as in the previous example.

    Phase 2. Analysis of the module on the AST


    Code understands AST-tree. Parsing is done by the Astroid library .

    Why Astroid, not AST (stdlib)


    Astroid inside itself does not use the standard Python AST module, but the  typed AST parser typed_ast , characterized in that it supports the TIPs from the PEP 484. Typed_ast  is a branch of the AST fork, which develops in parallel. Interestingly, there are the same bugs that are in AST, and are repaired in parallel.

    from module import Entity
    deffoo(bar):# type: (Entity) -> None return

    Previously, Astroid used a standard AST module in which it was possible to encounter the problem of using the tiepts defined in the comments used in the second Python. If you check such code through Pylint, then up to a certain point, it cursed unused imports, because the imported class Entity is only present in the comment.

    At some point on GitHub, Guido Van Rossum came to Astroid and said: “Guys, you have a Pylint that swears at such cases, and we have a typed AST parser that supports this all. Let's be friends!"

    Work has begun to boil! 2 years have passed, this spring, Pylint has already switched to a typed AST parser and stopped swearing at such things. Tayphint imports are no longer marked as unused.

    Astroid uses an AST parser to parse the code into a tree, and then does some interesting things when building it. For example, if you use import * , it will import everything by star, and add to locals to prevent errors with unused imports.

    Transform pluginsused in cases where there are some complex models based on meta-classes, when all attributes are generated dynamically. In this case, Astroid is very difficult to understand what is meant. When checking, Pylint will swear that there is no such attribute in the models when it is accessed, and using the Transform plugins you can solve the problem:

    • Help Astroid modify the abstract tree and understand the dynamic nature of Python.
    • Supplement the AST with useful information.

    A typical example is pylint-django . When working with complex django-models, the linter often swears at unknown attributes. Pylint-django just solves this problem.

    Phase 3. Disassemble the checker pool


    We return to the checker. Again we have a list of checkers, of which we find those that implement the AST checker interface.

    Phase 4. Disassemble checkers by node type.


    Then we find methods for each checker, they can be of two types:

    • visit_ <node name>
    • lеве_ <node name>.

    It would be nice while walking on a tree to know for which node which checkers you need to call. Therefore, they understand the dictionary, where the key is the name of the node, the value is a list of those checkers who are interested in entering this node.

    _visit_methods = dict( 
        <Имя ноды> : [checker1, checker2 ... checkerN]
    )
    

    The same with leave-methods: a key in the form of a node name, a list of checkers who are interested in the fact of leaving the node.

    _leave_methods = dict( 
        <Имя ноды>: [checker1, checker2 ... checkerN]
    ) 
    

    Launch Pylint. It shows a warning that we have a function with more than two arguments and a positional argument in it:

    $ cat work.ру
    get_offers(1, True, deal_type="sale")
    $ Pylint work.py --load-plugins non_kwargs_checker 
    C: 0, 0: Функция c >2 аргументами вызывается с позиционными аргументами (keyword-only-args)
    

    Problem solved. Now, code review programmers do not need to read the arguments of the function; the linker will do this for them. We saved our time , time for code review and tasks are faster in production.

    And write the tests?


    Pylint allows unit checkers to be tested and is very simple. From the point of view of the linter, the test-checker looks like a class that inherits from the abstract CheckerTestCase . It is necessary to specify the checker that is checked.

    classTestNonKwArgsChecker(CheckerTestCase): 
        CHECKER_CLASS = NonKeywordArgsChecker
    

    Step 1. Create a test AST node from the part of the code that we check.

    node = astroid.extract_node( 
        "get_offers(3, 'magic', 'args')"
    )
    

    Step 2. We check that the checker, going to the node, either throws, or does not throw the corresponding message:

    with self.assertAddsMessages(message):
        self.checker.visit_call(node)
    

    Tokenchecker


    There is another type of checker called TokenChecker . It works on the principle of a lexical analyzer. In Python, there is a tokenize module that does the work of a lexical scanner and breaks the code into a list of tokens. It might look something like this:


    The names of variables, functions, and keywords become tokens of the type NAME, and the delimiters, parentheses, and colons become tokens of the type OP. In addition, there are separate tokens for indentation, line feed and reverse translation.

    How Pylint works with TokenChecker:

    • The module under test is tokenized.
    • A huge list of tokens is passed to all checkers that implement ITokenChecker and the process_tokens (tokens) method is called .

    We have not found the use of TokenChecker, but there are some examples that Pylint itself uses:

    • Spell Check . For example, you can take all text-type tokens and look at lexical literacy, check words from stop-word lists, etc.
    • Check indents , spaces.
    • Work with strings . For example, you can verify that Unicode literals are not used in Python 3 or make sure that only ASCI characters are present in the byte string.

    findings


    We had a problem with code review. The developers did the work of the linter, wasting their time on meaningless scanning of the code and informing the author about errors. With Pylint, we:

    • We transferred the routine checks to the linter, implemented internal agreements in it.
    • Increased speed and quality code review.
    • Reduced the number of rejected pull requests, and the time for passing tasks to production has decreased.

    A simple checker is written in half an hour, and complex in a few hours. Checker saves much more time than it takes to write and fight back for a few undone pull request.

    You can learn more about Pylint and how to write checkers for it in the official documentation , but in terms of writing checkers it is rather poor. For example, about TokenChecker there is only a mention that it is there, but not about how to write the checker itself. More information is in the Pylint source on GitHub . You can see what checkers are in the standard package and be inspired to write your own.

    Knowledge of the internal device Pylint saves man-hours, simplifies
    operation and improves the code. Save your time, write good code and
    use linters.

    The next Moscow Python Conf ++ conference will take place on April 5, 2019  and you can book an early birf ticket now. It will be even better to collect your thoughts and apply for a report, then the visit will be free, and the bonus will be pleasant buns, including coaching for making the report.

    Our conference is a platform for meeting with like-minded people, engines of the industry, for communication and discussion of favorite Python developers: backend and web, data collection and processing, AI / ML, testing, IoT. How it went in the fall in the video report on our Python Channel and subscribe to the channel - we will soon post the best reports from the conference to the public.

    Also popular now: