Find typos in ** kwargs

    As the project, in which I am taking an active part, grows, I began to increasingly encounter similar typos in the names of the arguments of the function, as in the picture on the right. Especially expensive in debugging were similar errors in the class constructor, when an invalid base class parameter was passed with a long inheritance chain, or not passed at all. Redrawing interfaces to special user structures like namedtuple instead of ** kwargs had several problems:

    • Worse user experience. You need to pass a specially designed object to the function. It is not clear what to do with the optional arguments.
    • Complicated development. When inheriting classes, you need to inherit the corresponding argument structures. With namedtuple it will not work, you have to write your own tricky class. A bunch of implementation work.
    • And most importantly, it still did not completely save from typos in the names.

    The solution that I finally came to cannot protect 100% of all possible cases, but in those necessary 80% (in my project, 100%) does its job perfectly. In short, it consists in analyzing the source (byte) function code, constructing a matrix of distances between the found "real" names and transmitted from the outside and printing warnings according to the specified criteria. Sources .

    TDD


    So, first we set the task for sure. The following example should print 5 “suspicious” warnings:

    def foo(arg1, arg2=1, **kwargs):
        kwa1 = kwargs["foo"]
        kwa2 = kwargs.get("bar", 200)
        kwa3 = kwargs.get("baz") or 3000
        return arg1 + arg2 + kwa1 + kwa2 + kwa3
    res = foo(0, arg3=100, foo=10, fo=2, bard=3, bas=4, last=5)
    

    1. Instead arg2 passed arg3
    2. Instead of bar or baz passed bas
    3. Instead of bar, they passed bard
    4. Besides foo passed fo
    5. last generally superfluous

    Similarly, in the example with classes and inheritance, there should be the same warnings plus one more (instead of boo passed bog):

    class Foo(object):
        def __init__(self, arg1, arg2=1, **kwargs):
            self.kwa0 = arg2
            self.kwa1 = kwargs["foo"]
            self.kwa2 = kwargs.get("bar", 200)
            self.kwa3 = kwargs.get("baz") or 3000
    class Bar(Foo):
        def __init__(self, arg1, arg2=1, **kwargs):
            super(Bar, self).__init__(arg1, arg2, **kwargs)
            self.kwa4 = kwargs.get("boo")
    bar = Bar(0, arg3=100, foo=10, fo=2, bard=3, bas=4, last=5, bog=6)
    

    Problem Solving Plan


    • For the first example with a function, we will make a smart decorator, for the second with classes we will make a metaclass. They should share all the internal complex logic and essentially no different. Therefore, first we make the internal micro API and on it we already make the userspace user API. The decorator named detect_misprints and the metaclass called KeywordArgsMisprintsDetector (heavy Java / C # legacy, yeah).
    • The idea of ​​the solution was in the analysis of bytecode and finding the distance matrix. These are independent steps, so the micro API will consist of two corresponding functions. I called them get_kwarg_names and check_misprints.
    • To analyze the code, we use the standard inspect and dis , to calculate the distance between the lines - pyxDamerauLevenshtein . The requirements of the project were compatibility with a two and a triple, as well as PyPy. As you can see, raspberries do not spoil dependencies compatible with these requirements.

    get_kwarg_names (extract names from code)


    There should be a flap of code, but I'd rather give a link to it . The function takes an input function that takes an input function which ... and should return a lot of named arguments found. I am not particularly commentary, so I’ll briefly go through the main points.
    The first thing to do is to find out if the function has ** kwargs at all. If not, return the void. Next, we clarify the name of the “double star”, because ** kwargs is a generally accepted agreement and nothing more. Further, the logic, as often happens in version-portable code, is split, but not as usual into branches for two and three, but on <3.4 and> =. The fact is that sane disassembly support (along with total dis refactoring) appeared in 3.4. Before that, strangely enough, without third-party modules, you could only print the pythonium bytecode in stdout (sic!). The dis.get_instructions () function returns an instance generator of all bytecode instructions of the analyzed object. In general, as I understand it, the only reliable description of the bytecode is the leader of its opcodes, which, of course, is sad, because deployment to opcodes of specific instructions had to be determined experimentally.
    We will match two patterns: var = kwargs ["key"] and kwargs.get ("key" [, default]).

    >>> from dis import dis
    >>> def foo(**kwargs):
        return kwargs["key"]
    >>> dis(foo)
      2           0 LOAD_FAST                0 (kwargs)
                  3 LOAD_CONST               1 ('key')
                  6 BINARY_SUBSCR
                  7 RETURN_VALUE
    >>> def foo(**kwargs):
        return kwargs.get("key", 0)
    >>> dis(foo)
      2           0 LOAD_FAST                0 (kwargs)
                  3 LOAD_ATTR                0 (get)
                  6 LOAD_CONST               1 ('key')
                  9 LOAD_CONST               2 (0)
                 12 CALL_FUNCTION            2 (2 positional, 0 keyword pair)
                 15 RETURN_VALUE
    

    As you can see, in the first case it is a combination of LOAD_FAST + LOAD_CONST, in the second LOAD_FAST + LOAD_ATTR + LOAD_CONST. Instead of “kwargs”, in the argument of the instructions, one must look for the name of the “double star” found at the beginning. I send for a detailed description of the bytecode to knowledgeable people, well, we will be getting things done, that is, move on.
    And then we have an ugly workaround for older versions of Python on regular expressions. Using inspect.getsourcelines (), we get a list of the source lines of the function, and for each precompiled regular. This method is even worse than bytecode analysis, for example, expressions consisting of several lines or several expressions stitched with a semicolon are not defined in the current form. Well, that’s why he and workaround so as not to strain much ... However, this part can be objectively improved, I want to pull request :)

    check_misprints (distance matrix)


    Code . At the input, we get the result of the previous step, the named arguments passed, the mysterious tolerance, and the function to warn about. For each argument passed, you need to find the editing distance to each "real", i.e. which was found in the analysis of bytecode. In fact, there is no need to stupidly consider the entire matrix as a whole, if you have already found the perfect match, then you can not continue. Well, and, of course, the matrix is ​​symmetric, and, therefore, only half of it can be calculated. I think you can still optimize somehow, but with a typical number of kwargs less than 30, n 2 will work . We will calculate the distance of Damerau-Levenshtein as a well-known, popular and understandable to the author :) On the habr wrote about him, for example, here. Several packages were written for it under Python, I chose PyxDamerauLevenshtein for the portability of Cython, on which it is written and the optimal linear memory consumption.
    Further, a technical matter: if no even even remotely similar standard was found for the argument, we declare its categorical futility. If there are several matches with a distance less than tolerance - we declare our vague suspicions.

    detect_misprints


    Classic decorator , pre-compute the "real" names of named arguments (sorry for the tautology), and pull check_misprints with each call.

    KeywordArgsMisprintsDetector


    Our metaclass will intercept the moment of creating the type of the class (__init__, in which once in all its life it will calculate the “real” names and their names) and the moment of creating the class instance (__call__, which pulls check_misprints). The only point is that the class has mro and base classes, in the constructors of which ** kwargs may also be used. So in __init __-e we have to run through all the base classes and add the names of each argument to the common set.

    How to use


    Just add the decorator described above to the function or metaclass to the class.
    @detect_misprints
    def foo(**kwargs):
        ...
    @six.add_metaclass(KeywordArgsMisprintsDetector)
    class Foo(object):
        def __init__(self, **kwargs):
            ...
    

    Summary


    I looked at one way to deal with typos in ** kwargs names, and in my case, he solved all the problems and satisfied all the requirements. First, we analyzed the function bytecode or just the source code on older versions of Python, and then we built a matrix of distances between the names that are used in the function and the ones passed by the user. The distance was calculated according to Damerau-Levenshtein, and in the end they wrote a warning for two cases of errors - when the argument is “completely left” and when it looks like one of the “real” ones.
    The source code from the article is posted on GitHub . I will be happy with corrections and improvements. I also want to know your opinion whether this creation should be uploaded to PyPi.

    Also popular now: