What I learned about optimization in Python

Transfer

Hello. Today we want to share another translation prepared on the eve of the launch of the Python Developer course . Go!

I used Python more often than any other programming language in the last 4-5 years. Python is the predominant language for builds under Firefox, testing, and the CI tool. Mercurial is also mostly written in Python. I also wrote a lot of my third-party projects on it.

During my work, I gained a little knowledge about Python performance and its optimization tools. In this article, I would like to share this knowledge.

My experience with Python is mainly related to the CPython interpreter, especially CPython 2.7. Not all of my observations are universal for all Python distributions, or for those that have the same characteristics in similar versions of Python. I will try to mention this during the narrative. Remember that this article is not a detailed overview of Python performance. I will only talk about what I came across on my own.

The load due to the peculiarities of starting and importing modules

Starting the Python interpreter and importing the modules is a rather long process when it comes to milliseconds.

If you need to run hundreds or thousands of Python processes in any of your projects, then this delay in milliseconds will turn into a delay up to several seconds.

If you use Python to provide CLI tools, overhead can cause a noticeable freeze. If you need CLI tools instantly, running the Python interpreter with each call will make it harder to get this complex tool.

I already wrote about this problem. A few of my past notes talk about this, for example, in 2014 , in May 2018 and October 2018 .

There are not many things that you can do to reduce startup delay: fixing this case refers to manipulating the Python interpreter, since it is it that controls the execution of code that takes too much time. The best thing you can do is to disable the import of the site module in calls to avoid executing extra Python code at startup. On the other hand, many applications use the functionality of the site.py module, so you can use this at your own risk.

We should also consider the problem of importing modules. What good is the Python interpreter if it doesn't process any code? The fact is that the code is made available to the interpreter more often through the use of modules.

To import modules, you need to take several steps. And in each of them there is a potential source of loads and delays.

A certain delay occurs due to the search for modules and reading their data. As I demonstrated with PyOxidizerBy replacing the search and loading of the module from the file system with an architecturally simpler solution, which consists in reading the module data from the data structure in memory, you can import the standard Python library for 70-80% of the initial time for solving this problem. Having one module per file system file increases the load on the file system and can slow down a Python application during the critical first milliseconds of execution. Solutions like PyOxidizer can help avoid this. I hope that the Python community sees these costs of the current approach and is considering the possibility of switching to distribution mechanisms for modules that are not so much dependent on individual files in a module.

Another source of additional import costs for a module is the execution of code in that module during import. Some modules contain parts of the code in an area outside the functions and classes of the module, which is executed when the module is imported. Executing such code increases the cost of importing. Workaround: do not execute all the code at the time of import, but only execute it if necessary. Python 3.7 supports a module __getattr__that will be called if the attribute of a module was not found. This can be used to lazily populate module attributes on first access.

Another way to get rid of import slowdown is to lazily import the module. Instead of loading the module directly during import, you register a custom import module that returns a stub instead. When you first access this stub, it will load the actual module and “mutate” to become this module.

You can save tens of milliseconds with applications that import several tens of modules if you bypass the file system and avoid running unnecessary parts of the module (modules are usually imported globally, but only certain module functions are used).

Lazy import of modules is a fragile thing. A plurality of modules are templates that have the following things: try: import foo;except ImportError:. A lazy module importer may never throw an ImportError, because if he does, he will have to look in the file system for a module to see if it exists in principle. This will add extra load and increase the time spent, so lazy importers do not do this in principle! This problem is pretty annoying. Importer of lazy modules Mercurial processes a list of modules that cannot be lazily imported, and it must bypass them. Another problem is the syntax from foo import x, y, which also interrupts the import of a lazy module, in cases where foo is a module (as opposed to a package), because to return a reference to x and y, the module still needs to be imported.

PyOxidizer has a fixed set of modules sewn into the binary, so it can be effective in raising ImportError. The __getattr__ module from Python 3.7 provides additional flexibility for lazy module importers. I hope to integrate a reliable lazy importer into PyOxidizer to automate some processes.

The best solution to avoid starting the interpreter and causing time delays is to start the background process in Python. If you start the Python process as a daemon process, say for a web server, then you can do it. The solution Mercurial offers is to start a background process that provides a command server protocol. hg is the C executable (or now Rust), which connects to this background process and sends a command. To find an approach to the command server, you need to do a lot of work, it is extremely unstable and has security problems. I am considering the idea of delivering a command server using PyOxidizer so that the executable has its advantages, and the cost of a software solution itself was solved by creating the PyOxidizer project.

Function Call Delay

Calling functions in Python is a relatively slow process. (This observation is less applicable to PyPy, which can execute JIT code.)

I saw dozens of patches for Mercurial, which made it possible to align and combine the code in such a way as to avoid unnecessary load when calling functions. In the current development cycle, some efforts have been made to reduce the number of called functions when updating the progress bar. (We use progress bars for any operations that may take some time, so that the user understands what is happening). Getting the results of calling functions and avoiding simple searches among functions saves tens of hundreds of milliseconds when executed, when we talk about one million executions, for example.

If you have tight loops or recursive functions in Python, where hundreds of thousands or more function calls can happen, you should be aware of the overhead of calling a single function, as that matters a lot. Keep in mind simple built-in functions and the ability to combine functions to avoid overhead.

Attribute lookup overhead

This problem is similar to overhead due to a function call, since the meaning is almost the same!

Finding resolving attributes in Python can be slow. (And again, in PyPy, this is faster). However, handling this problem is what we often do in Mercurial.

Let's say you have the following code:

obj = MyObject()
total = 0for i in len(obj.member):
    total += obj.member[i]

Omit that there are more efficient ways to write this example (for example total = sum(obj.member)), and note that the loop needs to define obj.member at each iteration. Python has a relatively sophisticated mechanism for defining attributes . For simple types, it can be fast enough. But for this sophisticated type of access to attributes can automatically call __getattr__, __getattribute__various methods dunderand even user-defined functions @property. This is similar to a quick search for an attribute that can make several function calls, which will lead to an extra load. And this load can be aggravated if you use things like obj.member1.member2.member3etc.

Each attribute definition causes an extra load. And since almost everything in Python is a dictionary, we can say that every attribute search is a dictionary search. From general concepts about basic data structures, we know that dictionary search is not as fast as, say, index search. Yes, of course there are a few tricks in CPython that get rid of overhead due to dictionary searches. But the main topic I want to touch on is that any attribute lookup is a potential performance leak.

For hard loops, especially those that potentially exceed hundreds of thousands of iterations, you can avoid these measurable overheads for finding attributes by assigning a value to a local variable. Let's look at the following example:

obj = MyObject()
total = 0
member = obj.member
for i in len(member):
    total += member[i]

Of course, this can only be done safely if it is not replaced in a loop. If this happens, the iterator will keep a link to the old element and everything may explode.
The same trick can be done by calling an object method. Instead

obj = MyObject()
for i in range(1000000):
    obj.process(i)

You can do the following:

obj = MyObject()
fn = obj.process
for i in range(1000000:)
    fn(i)

It is also worth noting that in the case when the attribute search needs to call a method (as in the previous example), then Python 3.7 is relatively faster than previous releases. But I am sure that here the excessive load is connected, first of all, with the function call, and not with the load on the attribute search. Therefore, everything will work faster if you abandon the extra search for attributes.

Finally, since an attribute search calls a function for this, it can be said that attribute search is generally less of a problem than a load due to a function call. Typically, to notice significant changes in performance, you will need to eliminate a lot of attribute searches. In this case, as soon as you give access to all the attributes inside the loop, you can talk about 10 or 20 attributes only in the loop before calling the function. And loops with only thousands or less than tens of thousands of iterations can quickly provide hundreds of thousands or millions of attribute searches. So be careful!

Object load

From the point of view of the Python interpreter, all values are objects. In CPython, each element is a PyObject structure. Each object controlled by the interpreter is on the heap and has its own memory containing the reference count, object type and other parameters. Each object is disposed of by the garbage collector. This means that each new object adds overhead due to reference counting, garbage collection, etc. (And again, PyPy can avoid this unnecessary burden, as it “attaches more carefully” to the lifetime of short-term values.)

As a rule, the more unique values and Python objects you create, the slower things work for you.

Let's say you iterate over a collection of one million objects. You call a function to collect this object in a tuple:

for x in my_collection:
    a, b, c, d, e, f, g, h = process(x)

In this example, it process()returns an 8-tuple tuple. It doesn’t matter whether we destroy the return value or not: this tuple requires creating at least 9 values in Python: 1 for the tuple itself and 8 for its internal members. Well, in real life there can be fewer values if it process()returns a link to an existing object. Or, on the contrary, there may be more if their types are not simple and require many PyObjects to represent them. I just want to say that under the hood of the interpreter there is a real juggling of objects for the full presentation of certain constructions.

From my own experience I can say that these overheads are relevant only for operations that provide speed gains when implemented in a native language, such as C or Rust. The problem is that the CPython interpreter is simply unable to execute bytecode so fast that the extra load due to the number of objects matters. Instead, you are most likely to reduce performance by calling a function, or through cumbersome calculations, etc. before you can notice the extra load due to objects. There are, of course, a few exceptions, namely the construction of tuples or dictionaries of several values.

As a concrete example of overhead, you can cite Mercurial with C code that parses low-level data structures. For greater parsing speed, C code runs an order of magnitude faster than CPython does. But as soon as the C code creates PyObject to represent the result, the speed drops several times. In other words, the load involves creating and managing Python elements so that they can be used in code.

A way around this problem is to produce fewer elements in Python. If you need to refer to a single element, then start the function and return it, and not a tuple or a dictionary of N elements. However, do not stop monitoring the possible load due to function calls!

If you have a lot of code that works fast enough using the CPython C API, and elements that need to be distributed between different modules, do without Python types that represent different data as C structures and have already compiled code to access these structures instead of going through the CPython C API. By avoiding the CPython C API for accessing data, you will get rid of a large amount of excess load.

Treating elements as data (instead of having functions to access everything in a row) would be the best approach for a pythonist. Another workaround for already compiled code is to lazily instantiate PyObject. If you create a custom type in Python (PyTypeObject) to represent complex elements, you need to define the fieldstp_members or tp_getset to create custom C functions to find the value for an attribute. If you, say, write a parser and know that customers will only get access to a subset of the analyzed fields, you can quickly create a type containing raw data, return this type and call a C function to search for Python attributes that processes PyObject. You can even delay parsing until the function is called to save resources if parsing is never needed! This technique is quite rare, because it requires writing non-trivial code, but it gives a positive result.

Preliminary collection size determination

This applies to the CPython C API.

When creating collections, such as lists or dictionaries, use PyList_New()+ PyList_SET_ITEM()to fill out a new collection if its size has already been determined at the time of creation. This will pre-determine the size of the collection in order to be able to hold a finite number of elements in it. This helps to skip checking for a sufficient collection size when inserting items. When creating a collection of thousands of items, this will save you some resources!

Using Zero-copy in the C API

The Python C API really likes creating copies of objects rather than returning references to them. For example, PyBytes_FromStringAndSize () copies the char*reserved Python into memory. If you do this for a large number of values or big data, then we could talk about gigabytes of memory I / O and the associated load on the allocator.

If you need to write high-performance code without a C API, then you should familiarize yourself with the buffer protocol and related types, such as memoryview .

Buffer protocolbuilt into Python types and allows interpreters to cast a type from / to bytes. It also allows the C code interpreter to receive a handle.void*a certain size. This allows you to associate any address in memory with PyObject. Many functions that work with binary data transparently accept any object that implements buffer protocol. And if you want to accept any object that can be considered as bytes, then you need to use format unitss* , y*or w*when receiving function arguments.

Using buffer protocol, you give the interpreter the best opportunity available to use zero-copyoperations and refuse to copy extra bytes to memory.

By using types in Python view memoryview, you will also allow Python to access memory levels by reference, instead of making copies.

If you have gigabytes of code that go through your Python program, the insightful use of Python types that support zero-copy will save you from performance differences. I once noticed that python-zstandard turned out to be faster than any Python LZ4 binders (although it should be the other way around), since I used too much buffer protocoland avoided excessive memory I / O in python-zstandard!

Conclusion

In this article, I sought to talk about some of the things that I learned while optimizing my Python programs for several years. I repeat and say that it is not in any way a comprehensive overview of Python performance improvement methods. I admit that I probably use Python more demanding than others, and my recommendations cannot be applied to all programs. You should by no means massively correct your Python code and remove, for example, the search for attributes after reading this article . As always, when it comes to performance optimization, first fix where the code is especially slow. I highly recommend py-spyfor profiling Python applications. However, do not forget about the time spent on low-level activity in Python, such as calling functions and searching for attributes. Thus, if you have a hard cycle known to you, then experiment with the suggestions from this article and see if you can notice an improvement!

Finally, this article should not be interpreted as running into Python and its overall performance. Yes, you can argue that Python should or should not be used in certain situations due to performance characteristics. However, along with this, Python is quite universal - especially in conjunction with PyPy, which provides exceptional performance for a dynamic programming language. Python performance may seem pretty good to most people. For better or worse, I have always used Python for cases that stand out from others. Here, I wanted to share my experience, so that others would become a little clearer what life “on the battle line” could be. And maybe, just maybe, I can push smart people who use Python distributions,

By tradition, we are waiting for your comments ;-)

Tags: