Parallelizing recursive functions using the OpenMP 3.0 task

    image
    I recently stumbled upon the blog “ OpenMP 3.0 tasking model: as smooth as Cilk? " After that, I decided to check how well recursive functions parallelize with the OpenMP 3.0 task. Let me remind you that before the third standard there was no support for dynamic or irregular parallelism (for example, loops with while or recursive functions).


    Computing Systems


    I'll start with a description of the computing systems that I used.
    Below is a list of computers on which I can cover the results. The rest have not yet gone into production.
    image

    Parallelization process


    For testing, I took an example of calculating Fibonacci numbers . The idea of ​​parallelization was as follows: create your asynchronous task for each step of the recursion, and then organize synchronization before exiting the function. Unfortunately, with this approach, I did not get acceleration more than 1. Due to the dominance of runtime costs. The solution was found here and turned out to be simple: create tasks only for part of the recursion steps, and execute the rest in sequence. In other words, balance runtime costs and runtime. Below is a piece of code that is responsible for balancing: In this example, SPLITTER is just responsible for balancing. All source code can be downloaded here .
    INT_TYPE
    fib( INT_TYPE n ) {
    INT_TYPE i, j;
    if( n < SPLITTER ) {
    return fib_serial( n );
    } else {
    #pragma omp task shared( i )
    i = fib( n - 1 );
    #pragma omp task shared( j )
    j = fib( n - 2 );
    #pragma omp taskwait
    return i + j;
    }
    }


    But not so simple. For different SPLITTER values, the parallelization efficiency is different. By poking a finger at the sky, I selected two values ​​of the splitter 19 and 32 at which the parallelization efficiency approaches unity.

    Compilation


    Of all the software, only compilers will be of interest. I had three compilers at my disposal: Intel® Composer XE for Linux, Microsoft Visual C ++ 2010, and GCC. Visual C ++ is immediately swept away, as it does not support OpenMP 3.0 (at the moment there is only support for OpenMP2.5). GCC supports OpenMP 3.0 since version 4.4, I used 4.5.0.
    Below are compilation times and sizes of executable files with both dynamic libraries and static ones. But it is worth noting that, unlike GCC, Intel® Composer XE is installed on a remote machine. Which leads to an increase in compilation time.

    image

    The lines for GCC and Intel® Composer XE are as follows:
    # gcc ./fib_task.c -fopenmp
    # gcc ./fib_task.c -fopenmp -static
    # icc ./fib_task.c -openmp -static
    # icc ./fib_task.c -openmp

    Runtime and Scalability


    All measurements were performed to search for the 50th Fibonacci number for SPLITTER = 19 and SPLITTER = 32.
    image

    Parallelization acceleration and efficiency


    image
    This graph shows the acceleration for SPLITTER = 32. In turn, efficiency is the ratio of the number of threads and acceleration.

    Source and executable files


    As mentioned above, the source file can be downloaded here . In this example, the 41st Fibonacci number is calculated. To play measurements, you need to change the variable n = 50 and the variable expected = 12586269025UL. You can also immediately download

    conclusions


    Actually, charts with runtimes and accelerations all speak for themselves.
    • For Composer XE, the parallelization efficiency is ~ 1 until the number of threads does not exceed the number of physical cores.
    • For GCC 4.5.0, the parallelization efficiency is ~ 0.5 until the number of threads does not exceed the number of physical cores.

    The rest of the conclusions, I suggest you, do it yourself, as everyone is interested in his own.

    Afterword


    I found an article on PringerLink " An Extension to Improve OpenMP Tasking Control ", which proposed the introduction of a new final clause , which acts as a balancer. In my opinion, this is a sober proposal. And you can add clause-balancers, which in realtime will produce adaptive balancing.

    Please refer to the Optimization Notification page for more information on performance and optimization in Intel software products.

    Also popular now: