How can one simplify and speed up the calculation of a direct propagation neural network?

    Hello dear readers. A lot has been written and said about neural networks, mainly about how and why they can be applied. Moreover, somehow not much attention is paid to two important issues: a) how to simplify and quickly calculate a neural network (one calculation of the exponent is realized by library functions of programming languages, usually not less than 15-20 processor instructions), b) what, at least in part, the logic of the constructed network - in fact, the huge matrices of values ​​of weights and displacements obtained after training the network somehow do not really help to understand the patterns that this network has found (they remain hidden and the task of determining them is the task of willows tion of - sometimes very important). I’ll talk about one of my approaches to solving these issues for ordinary direct distribution neural networks,

    Bit of theory


    The direct distribution network, from a mathematical point of view, is a very large function, which includes the values ​​of the network inputs, weight coefficients, and neuron displacements. In each neuron of the layer, the values ​​of the inputs of the layer (vector X) are multiplied by the weight of the neuron (vector$ W_i $), add up with an offset $ B_i $

    $ s_i = W_iX + B_i $


    and enter activation functions $ A (s_i) $forming the outputs of layer neurons.

    Activation functions may not be very simple to calculate, for example, they often contain exponentials (exponential sigmoid, hyperbolic tangent). If you look at the assembler code that implements exponents, you can find, firstly, many different checks that are not always needed, and secondly, the calculation of the exponent itself is usually done in at least two operations:

    $ exp (v) = 2 ^ {v * log_2 (e)} $


    Therefore, if we want to speed up the calculation of the network, then the first task will be to simplify the calculation of the activation function. You can try to sacrifice a little quality due to a gain in speed, approximately replacing the calculation of the classical activation function with the calculation of a simpler function, which (on the available input data) gives approximately the same results. Generally speaking, this is a classical interpolation problem: we have a set of values ​​calculated by the original function A (s), and we select a simpler function that gives very similar values. Such a simple function a (s) can be an ordinary polynomial, or a polynomial with negative powers, or something like that. I used four types of such functions:

    $ a (s) = b_0 + b_1 * s + b_2 * s ^ 2 + ... + b_n * s ^ n $;
    $ a (s) = b_0 + b_1 / s + b_2 / s ^ 2 + ... + b_n / s ^ n $;
    $ a (s) = b_0 + b_1 * s ^ {0,5} + b_2 * s ^ 1 + b_3 * s ^ {1,5} + ... + b_n * s ^ {0,5n} $;
    $ a (s) = b_0 + b_1 / s ^ {0,5} + b_2 / s ^ 1 + b_3 / s ^ {1,5} + ... + b_n / s ^ {0,5n} $;

    Suppose that for each neuron we managed to replace the activation function with a slightly simpler one - this can be done, for example, by applying the least squares method. Such a substitution in itself will probably not give a very big gain. But here you can try another trick:

    1. Write analytically huge function NET (X) computed by the network as a whole;
    2. Replace the original functions A (s) in NET (X) with the replacement functions a (s) obtained for them;
    3. Simplify the algebraically obtained NET (X) (or rather, use some ready-made code for symbolic simplification of expressions). This is already possible (at least, much easier than we would try to simplify the network with the original functions, for example, with exponents).

    As a result, we get something simpler and, perhaps, a little more mathematically obvious - here you can already try to understand what kind of function the network implements.

    This is the option of explaining the logic of the constructed network.

    The described task, of course, only in words looks simple. For use in my programs, I needed to write my own code for symbolic simplification of expressions. In addition, I solved a more complex problem, assuming that each neuron with function A (s) can have several options for an alternative activation function$ a_k (s) $, therefore, the general task also boiled down to enumerating options for such functions and symbolic simplification of the network for each such option. Here only parallelization of calculations helped.

    Result


    The result pleased me. I accelerated a three-layer network (with three inputs) of eight neurons (with input weights and displacements) with the activation functions “exponential sigmoid”. As shown by measurements of time, it was possible to obtain a gain of about 40% in time without significant loss in quality.

    I illustrate. Here is the data of the source network:





    And in the third, output layer:


    If you designate the inputs as a, b and c, then, after replacements and simplifications, the network function NET is considered like this:

    double a2 = a*a;
    double b2 = b*b;
    double c2 = c*c;
    double a3 = a2*a;
    double b3 = b2*b;
    double c3 = c2*c;
    double z01 = sqrt(-1.6302e-02+7.9324e-01*a+9.65149e-01*b+5.64151e-01*c);
    double z06 = sqrt(1.583708e+00-8.907654e-01*a-2.844379e-01*a2+1.050942e+00*a3+1.178096e+01*b-1.865618e+00*b*a-3.145465e+00*b*a2-5.777153e+00*b2+3.138123e+00*b2*a-1.043599e+00*b3+1.32778e+00*c+5.849582e-01*c*a-3.440382e+00*c*a2+1.838371e+00*c*b+6.864703e+00*c*b*a-3.42434e+00*c*b2-3.013361e-01*c2+3.754167e+00*c2*a-3.745404e+00*c2*b-1.365524e+00*c3+1.014237e-01*z01);
    double NET = (-1.477593e+00)/(z06)+1.370237e+00-6.303167e-02*a-1.495051e-03*a2+2.33748e-02*a3+5.558024e-02*b+1.178189e-02*b*a-6.996071e-02*b*a2+1.837937e-02*b2+6.97974e-02*b2*a-2.321149e-02*b3+7.924241e-02*c+3.392287e-03*c*a-7.652018e-02*c*a2-1.214263e-02*c*b+1.526831e-01*c*b*a-7.616337e-02*c*b2-1.915279e-03*c2+8.349931e-02*c2*a-8.33044e-02*c2*b-3.037166e-02*c3+1.949161e-02*z01;
    

    Winning - I repeat, 40% of the time, without much damage to quality. I think this approach can be applied in cases where the speed of computing a neural network is critical - for example, if it is calculated repeatedly, in a double or triple cycle. An example of such a problem : a numerical solution of the aerodynamics problem on a grid, and in each of its nodes the neural network calculates some useful forecast, for example, for a more accurate calculation of turbulent viscosity. Then we have an external cycle in time, a double or triple cycle in coordinate is embedded in it, and already there, inside, there is a calculation of a neural network. In this case, simplification is more than appropriate and useful.

    Also popular now: