Optimal Options for x86 GCC

          It is widely believed that GCC lags behind other compilers in performance. In this article, we will try to figure out what basic GCC compiler optimizations should be used to achieve acceptable performance.

    What are the default options in GCC?

          (1) By default, the GCC uses the optimization level “-O0”. It is clearly not optimal in terms of performance and is not recommended for compiling the final product.
    GCC does not recognize the architecture on which compilation is run until the option “-march = native” is passed. By default, GCC uses the option specified during its configuration. To find out the configuration of GCC, just run:
       gcc -v
       "Configured with: [PATH] / configure ... --with-arch = corei7 --with-cpu = corei7 ..."

    This means that GCC will add “-march = corei7” to your options (unless a different architecture is specified).
    Most GCC compilers for x86 (basic for 64-bit Linux) add: “-mtune = generic -march = x86-64” to the given options, because the configuration did not specify options that define the architecture. You can always find out all the options that are passed when you start GCC, as well as its internal options using the command:
       echo "int main {return 0;}" | gcc [OPTIONS] -xc -v -Q -

    In summary, often used:
       gcc -O2 test.c

    will build “test.c” without any specific architectural optimizations. This can lead to a significant drop in performance (relatively architecturally optimized code). Disabled or limited vectorization and suboptimal code planning are the most common causes of performance degradation unless you specify or specify the wrong architecture.
    To indicate the current architecture, you need to compile like this:
       gcc -O2 test.c -march = native

    Specifying the architecture used is important for performance. The only exception can be considered those programs where the call of library functions takes almost all of the launch time. GLIBC can choose the best function for a given architecture at runtime. It is important to note that with static linking, some GLIBC functions do not have versions for different architectures. That is, dynamic assembly is better if the speed of GLIBC functions is important. .
          (2) By default, most GCC compilers for x86 in 32-bit mode use the x87 floating-point model since they were configured without “-mfpmath = sse”. Only if the GCC configuration contains “--with-mfpmath = sse”:
       gcc -v
       "Configured with: [PATH] / configure ... --with-mfpmath = sse ..."

    the compiler will use the default SSE model. In all other cases, it is better to add the option “-mfpmath = sse” to the assembly in 32 bit mode.
    So, often used:
       gcc -O2 -m32 test.c

    can lead to significant performance losses in code with real arithmetic. Therefore, the correct option:
       gcc -O2 -m32 test.c -mfpmath = sse

    Adding the option “-mfpmath = sse” is important in 32 bit mode! An exception is the compiler, in the configuration of which there is “--with-mfpmath = sse".

    32 bit mode or 64 bit?

          32-bit mode is usually used to reduce the amount of memory used and as a result of speeding up work with it (more data is placed in the cache).
    In 64-bit mode (compared to 32-bit), the number of available general registers increases from 6 to 14, XMM registers from 8 to 16. Also, all 64 bit architectures support SSE2 extension, so in the 64-bit mode you do not need to add the “-mfpmath option = sse ”.
    It is recommended to use 64 bit mode for counting tasks, and 32 bit mode for mobile applications.

    How to get maximum performance?

          There is no specific set of options to get maximum performance, but there are many options in GCC that you should try using. Below is a table with recommended options and growth forecasts for Intel Atom and 2nd Generation Intel Core i7 processors relative to the “-O2” option. Forecasts are based on the geometric mean of the results of a particular set of tasks compiled by GCC version 4.7. It is also assumed that the compiler was configured for x86-64 generic.
         The forecast for increased productivity on mobile applications relative to “-O2" (only in 32 bit mode, since it is the main one for the mobile segment):
    -m32 -mfpmath = sse~ 5%
    -m32 -mfpmath = sse -Ofast -flto~ 36%
    -m32 -mfpmath = sse -Ofast -flto -march = native~ 40%
    -m32 -mfpmath = sse -Ofast -flto -march = native -funroll-loops~ 43%

         Prediction of performance increase in computational tasks relative to “-O2" (in 32 bit mode):
    -m32 -mfpmath = sse~ 4%
    -m32 -mfpmath = sse -Ofast -flto~ 21%
    -m32 -mfpmath = sse -Ofast -flto -march = native~ 25%
    -m32 -mfpmath = sse -Ofast -flto -march = native -funroll-loops~ 24%

         Prediction of performance increase in computational tasks relative to “-O2" (in 64 bit mode):
    -m64 -fast -flto~ 17%
    -m64 -Ofast -flto -march = native~ 21%
    -m64 -Ofast -flto -march = native -funroll-loops~ 22%

    The advantage of the 64-bit mode over the 32-bit one for computing tasks with the “-O2 -mfpmath = sse” options is about ~ 5%.
    All the data in the article is a forecast based on the results of a specific set of benchmarks.
    The following is a description of the options used in the article. Full description (in English): http://gcc.gnu.org/onlinedocs/gcc-4.7.1/gcc/Optimize-Options.html "
    • "-Ofast" like "-O3 -ffast-math" includes a higher level of optimizations and more aggressive optimizations for arithmetic calculations (for example, real reassociation)
    • "-flto" intermodule optimizations
    • "-m32" 32 bit mode
    • "-mfpmath = sse" enables the use of XMM registers in real arithmetic (instead of the real stack in x87 mode)
    • "-funroll-loops" enables loop deployment

    Also popular now: