How to make your code 80 times faster

Original author: Antonio Cuni
  • Transfer
All beaver!

We are starting the third set for the Python Developer course , which means that there is an open lesson ahead , which we partially replace the old-format open days and where you can find interesting material from our teachers, and the fact that we found another interesting material . This time to speed up the "snake" code.

Go.

PyPy is able to speed up the code by 2 times, which pleases so many people. I want to share a short, personal story proving that PyPy is capable of more.

DISCLAIMER: this is not a miracle cure for all occasions, yes, it worked specifically in this case, but it may not be as effective in many others. However, the method is still interesting. Moreover, the steps described here I applied during development in the same order, which makes the article a life-saving example of PyPy optimization.

I experimented with evolutionary algorithms several months ago: the plan was ambitious - to automatically develop logic that can control a (simulated) quadrocopter, that is, a PID controller ( spoiler : does not fly).



The idea is that, in the presence of a population of random creatures, in each generation the most fit ones survive and multiply with small, random changes.

However, within the framework of this post, the initial task is not so important, so we will go directly to the code. To control the quadrocopter, the creature uses a method run_stepthat runs in each delta_t(the code is complete):

class Creature(object):
    INPUTS = 2  # z_setpoint, current z position
    OUTPUTS = 1 # PWM for all 4 motors
    STATE_VARS = 1
    ...
    def run_step(self, inputs):
        # state: [state_vars ... inputs]
        # out_values: [state_vars, ... outputs]
        self.state[self.STATE_VARS:] = inputs
        out_values = np.dot(self.matrix, self.state) + self.constant
        self.state[:self.STATE_VARS] = out_values[:self.STATE_VARS]
        outputs = out_values[self.STATE_VARS:]
        return outputs

  • inputs - numpy array in which the final point and the current position along the Z axis are located;
  • outputs - numpy array in which the thrust transmitted to the motors is located. To begin with, all 4 motors are limited by the same traction, so the quadrocopter moves up and down along the Z axis;
  • self.state contains arbitrary values ​​of unknown size, which are transferred from one step to another;
  • self.matrixand self.constantcontain the logic itself. When transmitting the “correct” values ​​to them, theoretically, we could get a perfectly tuned PID controller. They randomly mutate between generations.

run_stepcalled at 100 Hz (in the virtual simulation time interval). The generation consists of 500 creatures, each of which we test for 12 virtual seconds. Thus, each generation contains 600,000 accomplishments run_step.

First I tried running the code on CPython; and here is the result:

$ python -m ev.main
Generation   1: ... [population = 500]  [12.06 secs]
Generation   2: ... [population = 500]  [6.13 secs]
Generation   3: ... [population = 500]  [6.11 secs]
Generation   4: ... [population = 500]  [6.09 secs]
Generation   5: ... [population = 500]  [6.18 secs]
Generation   6: ... [population = 500]  [6.26 secs]

That is ~ 6.15 seconds / generation, with the exception of the first.

Then, I tried PyPy 5.9:

$ pypy -m ev.main
Generation   1: ... [population = 500]  [63.90 secs]
Generation   2: ... [population = 500]  [33.92 secs]
Generation   3: ... [population = 500]  [34.21 secs]
Generation   4: ... [population = 500]  [33.75 secs]

Oh! It turned out ~ 5.5 times slower than with CPython. This was partly expected: nympy is based on cpyext, known for its slowness. (In fact, we are working on this - on the brunch the cpyext-avoid-roundtripresult is already better than with CPython, but we'll talk about this in another post).

Let's try to avoid cpyext. The first obvious step is to use numpypy instead of numpy (here's a hack that allows you to use only part of micronumpy). Check if the speed has improved:

$ pypy -m ev.main   # using numpypy
Generation   1: ... [population = 500]  [5.60 secs]
Generation   2: ... [population = 500]  [2.90 secs]
Generation   3: ... [population = 500]  [2.78 secs]
Generation   4: ... [population = 500]  [2.69 secs]
Generation   5: ... [population = 500]  [2.72 secs]
Generation   6: ... [population = 500]  [2.73 secs]

On average, ~ 2.7 seconds: 12 times faster than PyPy + numpy and more than 2 times faster than the original CPython. Already, many would run to tell how good PyPy is.

But try to improve the result. As usual, we analyze what exactly requires the most time. Here is a link to the vmprof profile . We spend a lot of time inside numpypy and allocate tons of temporary arrays to store the results of various operations.

In addition, let's look at the jit trace and find the run function: this is the cycle in which we spend most of the time, it consists of 1796 operations. The operations for the line np.dot(...)+ self.constant are between lines 1217 and 1456. The following is an excerpt that callsnp.dot(...). Most of the operators cost nothing, but on line 1232 we see a call to the RPython function descr_dot ; by implementation, we see that it creates a new W_NDimArray to store the result, which means we have to do it malloc():



An interesting implementation of the part + self.constant - the call W_NDimArray.descr_addwas built in by JIT, so it's easier for us to understand what is happening. In particular, we see a call to __0_alloc_with_del____, allocating W_NDimArray for the result, and aw_malloc, allocating the array itself. Then comes a long list of 149 simple operations that set the fields of the resulting array, create an iterator, and finally call acall_assembler - this is the actual logic for the addition, which is JIT or separately. call_assemblerone of the operations for making JIT-to-JIT calls:



All this is not very optimal: in our case, it is known that the size is self.matrixalways equal to (3, 2) - which means we do a huge amount of work, including 2 calls malloc()for temporary arrays, just to call two functions that make 6 multiplications in total and 6 additions. Note that this is not JIT's fault: CPython + numpy has to do the same, but in hidden calls to C. The

well-known compiler optimization, the loop reversal, will probably help to solve this problem. From the point of view of the compiler, a loop reversal is always accompanied by a risk - if the matrix is ​​too large, you will ultimately be left with a large, shapeless mass of code that will be useless if the size of the array is constantly changing. This is one of the main reasons why PyPy JIT in this case does not even try to do so.

However, we know that the matrix is ​​small and always the same size. Therefore, we expand the loop manually:

class SpecializedCreature(Creature):
    def __init__(self, *args, **kwargs):
        Creature.__init__(self, *args, **kwargs)
        # store the data in a plain Python list
        self.data = list(self.matrix.ravel()) + list(self.constant)
        self.data_state = [0.0]
        assert self.matrix.shape == (2, 3)
        assert len(self.data) == 8
    def run_step(self, inputs):
        # state: [state_vars ... inputs]
        # out_values: [state_vars, ... outputs]
        k0, k1, k2, q0, q1, q2, c0, c1 = self.data
        s0 = self.data_state[0]
        z_sp, z = inputs
        #
        # compute the output
        out0 = s0*k0 + z_sp*k1 + z*k2 + c0
        out1 = s0*q0 + z_sp*q1 + z*q2 + c1
        #
        self.data_state[0] = out0
        outputs = [out1]
        return outputs

The code additionally has a health check to make sure that the value is equal to issued through Creature.run_step.

Let's check how it works. First with CPython:

$ python -m ev.main
Generation   1: ... [population = 500]  [7.61 secs]
Generation   2: ... [population = 500]  [3.96 secs]
Generation   3: ... [population = 500]  [3.79 secs]
Generation   4: ... [population = 500]  [3.74 secs]
Generation   5: ... [population = 500]  [3.84 secs]
Generation   6: ... [population = 500]  [3.69 secs]

It looks good - 60% faster than the original implementation of CPython + numpy. Check PyPy:

Generation   1: ... [population = 500]  [0.39 secs]
Generation   2: ... [population = 500]  [0.10 secs]
Generation   3: ... [population = 500]  [0.11 secs]
Generation   4: ... [population = 500]  [0.09 secs]
Generation   5: ... [population = 500]  [0.08 secs]
Generation   6: ... [population = 500]  [0.12 secs]
Generation   7: ... [population = 500]  [0.09 secs]
Generation   8: ... [population = 500]  [0.08 secs]
Generation   9: ... [population = 500]  [0.08 secs]
Generation  10: ... [population = 500]  [0.08 secs]
Generation  11: ... [population = 500]  [0.08 secs]
Generation  12: ... [population = 500]  [0.07 secs]
Generation  13: ... [population = 500]  [0.07 secs]
Generation  14: ... [population = 500]  [0.08 secs]
Generation  15: ... [population = 500]  [0.07 secs]

Yes, this is not a mistake. After a couple of generations, time stabilizes in the region of ~ 0.07-0.08 seconds per generation. Which is about 80 (eighty) times faster than the CPython + numpy implementation, and 35-40 times faster than the naive PyPy + numpy.

Let's look at the trace once again : there are no more expensive calls and definitely no temporary malloc () 's. The root of the logic is in lines 386-416, where we see the execution of fast C-level multiplications and additions: float_mul and float_add are translated directly into the mulsd and addsdx86 commands.

As I said, this is a very concrete example, and this method is not universal: unfortunately, you should not wait for 80-fold acceleration of arbitrary code. However, this clearly shows the potential of PyPy in high-speed computing. And most importantly, this is not a trial benchmark designed specifically for PyPy to work well - an example is small but realistic.

How to reproduce the result

$ git clone https://github.com/antocuni/evolvingcopter
$ cd evolvingcopter
$ {python,pypy} -m ev.main --no-specialized --no-numpypy
$ {python,pypy} -m ev.main --no-specialized
$ {python,pypy} -m ev.main

THE END

If anything, then we are waiting for questions here or in an open lesson .

Also popular now: