alec_kalinin August 29, 2017 at 18:00

Python performance benchmark for computational tasks

Motivation

Most recently, a new version 0.34 of the Numba optimizer JIT compiler library for Python has been released. And cheers there! The long-awaited semantics of annotations and a set of methods for organizing parallel computations appeared. The basis was taken by Intel Parallel Accelerator technology .

In this article I want to share the results of the first test of computing speed based on this library for some modern machine with a quad-core processor.

Introduction

Currently, Python is very actively used in scientific computing, and in the field of Machine Learning in general is almost one of the standards. But if you look a little deeper, almost everywhere Python is used as a wrapper over lower-level libraries, written mostly in C / C ++. Is it possible to write really fast and parallel code in pure Python?

Consider a very simple task. Let us be given two sets of N points in three-dimensional space: p and q . It is necessary to calculate a special matrix based on pairwise distances between all points:

$R (i, j) = \ frac {1} {1+ \ | p (i) - q (j) \ | _2}$

For all tests, we take N = 5000. The calculation time is averaged over 10 starts.

C ++ implementation

As a reference point, we take the following implementation in C ++:

void getR(std::vector & p, std::vector & q, Matrix & R)
{
    double rx, ry, rz;
    #pragma omp parallel for
    for (int i = 0; i < p.size(); i++) {
        for (int j = 0; j < q.size(); j++) {
	    rx = p[i].x - q[j].x;
            ry = p[i].y - q[j].y;
            rz = p[i].z - q[j].z;
            R(i, j) = 1 / (1 + sqrt(rx * rx + ry * ry + rz * rz));
        }
    }
}

The outer loop at p points is performed in parallel using OpenMP technology .

Lead time: 44 ms.

Pure python

Let's start the speed test with pure Python code:

def get_R(p, q):
    R = np.empty((p.shape[0], q.shape[1]))
    for i in range(p.shape[0]):
        for j in range(q.shape[1]):
            rx = p[i, 0] - q[0, j]
            ry = p[i, 1] - q[1, j]
            rz = p[i, 2] - q[2, j]
            R[i, j] = 1 / (1 + math.sqrt(rx * rx + ry * ry + rz * rz))
    return R

Runtime: 52,861 ms, more than 1000 times slower than the base implementation.

Python is an interpreted language, Python uses GIL inside itself , which makes parallelization impossible at the level of the code itself. It is all very slow. Now we’ll start to speed it all up.

Python + NumPy + SciPy

The problem of Python slowness for numerical problems was recognized a long time ago. And the answer to this problem was the NumPy library . The ideology of NumPy is in many respects close to MatLab, which is a universally recognized tool for scientific calculations.

We stop thinking iteratively, we start thinking with matrices and vectors as atomic objects for calculation. And all operations with matrices and vectors at the lower level are already performed by high-performance linear algebra libraries Intel MKL or ATLAS.

The implementation on NumPy looks like this:

def get_R_numpy(p, q):
    Rx = p[:, 0:1] - q[0:1]
    Ry = p[:, 1:2] - q[1:2]
    Rz = p[:, 2:3] - q[2:3]
    R = 1 / (1 + np.sqrt(Rx * Rx + Ry * Ry + Rz * Rz))
    return R

In this implementation, there is not a single cycle at all!

Runtime: 839 ms, which is 19 times slower than the base implementation.

Moreover, NumPy and SciPy have a huge number of built-in functions. The implementation of this task on SciPy looks like this:

def get_R_scipy(p, q):
    D = spd.cdist(p, q.T)
    R = 1 / (1 + D)
    return R

Runtime: 353 ms, which is 8 times slower than the base implementation.

For most tasks, this is already an acceptable time. The price of this is to switch the way of thinking, now it is necessary to collect code from the basic operations of linear algebra. Sometimes it looks very nice, but sometimes you have to come up with different tricks.

But what about parallelism? Here she is implicit. We hope that at a low level, all operations with matrices and vectors are implemented efficiently and in parallel.

But what if our code does not fit into linear algebra, or do we want explicit parallelization?

Python + Cython

Here comes the time of Cython . Cython is a special language that allows you to embed code in a C-way language inside regular Python code. Next, Cython converts this code into .c files, which are compiled into python modules. These modules can be called transparently enough in other parts of Python code. The implementation on Cython looks like this:

@cython.wraparound(False)
@cython.nonecheck(False)
@cython.boundscheck(False)
def get_R_cython_mp(py_p, py_q):
    py_R = np.empty((py_p.shape[0], py_q.shape[1]))
    cdef int nP = py_p.shape[0]
    cdef int nQ = py_q.shape[1]
    cdef double[:, :] p = py_p
    cdef double[:, :] q = py_q
    cdef double[:, :] R = py_R
    cdef int i, j
    cdef double rx, ry, rz
    with nogil:
        for i in prange(nP):
            for j in xrange(nQ):
                rx = p[i, 0] - q[0, j]
                ry = p[i, 1] - q[1, j]
                rz = p[i, 2] - q[2, j]
                R[i, j] = 1 / (1 + sqrt(rx * rx + ry * ry + rz * rz))
    return py_R

What's going on here? The function accepts python numpy objects as an input, then they are converted to typed Cython C-structures, then gil is turned off and using the special 'prange' construct, the external loop is executed in parallel.

Runtime: 76 ms, which is 1.75 times slower than the base implementation.

Well, we almost got closer to the base implementation, we started writing explicit parallel code. But at the cost of this was less readable code, we moved away from pure Python.

In general, most numerical calculations are written that way. Most of them are in NumPy, and some places critical for speed are moved to separate modules and implemented in cython.

Python + Numba

We have come a long way. We started with pure Python, then went the way of the magic of matrix computing, then plunged into a special extension language. It's time to go back to where we started. So, the implementation in Python + Numba:

@jit(float64[:, :](float64[:, :], float64[:, :]), nopython=True, parallel=True)
def get_R_numba_mp(p, q):
    R = np.empty((p.shape[0], q.shape[1]))
    for i in prange(p.shape[0]):
        for j in range(q.shape[1]):
            rx = p[i, 0] - q[0, j]
            ry = p[i, 1] - q[1, j]
            rz = p[i, 2] - q[2, j]
            R[i, j] = 1 / (1 + math.sqrt(rx * rx + ry * ry + rz * rz))
    return R

Execution time: 46 ms, which almost coincides with the basic implementation!

And all that had to be done for this with the original slow Python code:

add annotation specifying types and parallel execution type
replace 'range' with 'prange' to execute the outer loop in parallel

Numba is a very interesting project. This is an LLVM-based optimizing JIT compiler for Python. It is absolutely transparent to use, no separate modules are needed. All you need to do is add a few annotations to the speed-critical methods.

In summary, the execution times are as follows:

C ++	Python	Numpy	Scipy	Cython	Numba
44 ms	52 861 ms	839 ms	353 ms	76 ms	46 ms

Tags: