
Should I put Gentoo for speed?
Perhaps someone among you once heard: “I plan to install Gentoo for myself, it will be better to use the capabilities of my processor and will squeeze the most out of it.” Well, let's figure it out ...

Basically, this implies the use of additional sets of instructions such as: MMX, SSE, AES and AVX when compiling applications. However, if you dig deep, there are other optimizations, and not only for applications.
I have identified the following optimization groups:
Optimizations for additional instruction sets are best covered on the page: Intel 386 and AMD x86-64 GCC Options . Starting with Pentium MMX, MMX became available to us, then AMD made 3DNow !, then SSE appeared in Pentium III and it went. Intel Haswell, which pleased us this year, is supported by: MOVBE, MMX, SSE, SSE2, SSE3, SSSE3, SSE4.1, SSE4.2, AVX, AVX2, AES, PCLMUL, FSGSBASE, RDRND, FMA, BMI, BMI2 and F16C.
Working with real numbers ( FPU ) is also indirectly related to additional instruction sets, because the compiler can use SSE for this. It is faster than x87 instructions and does not block MMX. A little more about this will be lower.
Another important feature should be noted. When the abbreviations SSE, MMX and AES are mentioned aloud, very often people familiar with these concepts have a picture popping up in their heads telling about the difficult life of C programmers who assembler support files for these instructions in their software. In fact, there are already 3 ways to use these sets of instructions: automatically by the compiler for static code analysis, manually using special compiler functions, and manually using assembler inserts (for example: How to optimize code for MMX processors ). In which particular cases GCC automatically uses sets of instructions, if they are allowed, is not clear, but the manual clearly states that there are such cases (hopefully they will be written in the comments).
Static analysis code optimizations are best covered on the Options That Control Optimization page . There are a lot of possibilities, but in order not to get confused, they are grouped into meta flags: O0, O1, O2, O3, Ofast. More information on these flags can be found below in linked articles.
Optimizations for a better hit in the processor cache . It’s not possible to quickly explain, so I am sending the reader to another article: Bubbles, caches, and branch predictors . I can only say that programmers can use Intel VTune Performance Analyzer and AMD CodeAnalyst to analyze places that can be optimized . And, like, ICC Intel C ++ compilerknows how to do such optimizations in some cases automatically, but how are things with GCC today, I hope knowledgeable people will add in the comments.
Kernel-level code optimizations . They allow speeding up functions in the Cryptographic API Framework, such as AES, Twofish and others, using additional sets of instructions, such as: SSE, AVX, AES. These functions can be used in other kernel modules, as well as called externally from applications.
We figured out the theory, let's move on to how it is used.
Suppose you are sitting on Ubuntu. Depending on the size of the operating system, you can choose from packages with the suffix i386 or amd64 ( example ). i386 does not mean at all that the package will work on any processor, starting from 386, it simply means that the package is intended for x86 32-bit platform. In turn, amd64 means support for the x86-64 64-bit platform. We can easily check this if we type in the console:
On 32-bit Ubuntu 12.04 LTS Server we will see i686-linux-gnu , and on 64-bit we should see x86_64-linux-gnu .
Suppose you have a 32-bit Pentium 4, you can use MMX, SSE and SSE2, but they were not used to generate packages, since these packages should work on Intel Celeron, where there is only MMX, and possibly even on Pentium Pro, where there is no even MMX.
Additional instruction sets will be used only in packages that themselves determine the processor on the fly and include a faster algorithm for this processor. The good news is that this happens in almost all multimedia packages.
It is also not clear with which code optimizations 32-bit Ubuntu was going to. If you look at the output of GCC, that is, a little from -O1, and from -O2 and from -O3. If packages for Ubuntu for a specific version are usually assembled on the system itself with default compilation options, then apparently they are not collected in the most optimal (from rational) way.
Finally, the functions in the kernel Cryptographic API are not optimized. Optimized functions for additional instruction sets are present in the system only in the form of modules, and only for i586 and AES (for VIA Nano), but are not loaded by default. It is also not clear what of 586 can be used for optimizations.
On Ubuntu 12.04 64-bit, things are much better. First: gcc uses extensions by default for 64-bit systems: MMX, SSE, SSE2, so the code can be slightly optimized. Secondly, for x86-64, the default is -mfpmath = sse , which speeds up arithmetic for real numbers.
Optimized kernel Cryptographic API functions for additional instruction sets are present in the system in modules, but are not loaded by default. At least they can be turned on.
And finally, gcc collects packages with the same strange set of optimizations as for Ubuntu 32-bit.
Most likely you read this page from the manual . So you set yourself -O2 and -march = native (or the correct processor). But most likely you are in the first place: you did not enter the Cryptographic API when setting up the kernel and did not speed up some instructions for yourself, but at least it is worth speeding up AES. Secondly: you most likely did not set USE flags for additional processor instructions that are available to you: 3dnow, mmx, sse, sse2, sse3. Or not all of them were exposed. This means that for applications that intentionally activate optimizations in USE flags, you are left without additional acceleration.
app-arch / libzpaq
app-emulation / bochs
media-libs / freeverb3 (audio)
media-libs / libpostproc (video)
media-libs / libvpx (video VP8)
media-plugins / vdr-softdevice (video)
media-sound / mpg123
media-video / ffmpeg
media-video / libav
media-video / mplayer
media-video / mplayer2
media-video / vlc
net-libs / cyassl
net-misc / bfgminer (bitcoin)
sci-biology / raxml
sci-libs / fftw
sci- chemistry / gromacs
sys-fs / loop-aes
x11-libs / pixman
Additionally, the orc flag, which is already set by default, helps enable additional processor instructions in:
media-libs / gstreamer (audio + video)
Additionally, the cpudetection flag, which is not set by default, helps enable additional processor instructions on the fly in:
media-sound / jack-audio-connection-kit
media-video / ffmpeg
media-video / libav
media-video / mplayer
media-video / mplayer2
sci-libs / mpir
Add-on : User kekekeks shared a recipe on how to rebuild some packages for Ubuntu with optimization.

What are some optimizations for the processor
Basically, this implies the use of additional sets of instructions such as: MMX, SSE, AES and AVX when compiling applications. However, if you dig deep, there are other optimizations, and not only for applications.
I have identified the following optimization groups:
- Code optimizations
- Code optimization during compilation for additional sets of x86 instructions : MMX, SSE, AES, ATA, AVX, etc.
- Optimization of the code during its static analysis during compilation: deployment of tail recursions, removal of unused sections of code, ignoring senseless conditions, etc.
- Optimizations for a better hit in the processor cache.
- Kernel-level code optimizations: cryptographic methods from the Cryptographic API .
Optimizations for additional instruction sets are best covered on the page: Intel 386 and AMD x86-64 GCC Options . Starting with Pentium MMX, MMX became available to us, then AMD made 3DNow !, then SSE appeared in Pentium III and it went. Intel Haswell, which pleased us this year, is supported by: MOVBE, MMX, SSE, SSE2, SSE3, SSSE3, SSE4.1, SSE4.2, AVX, AVX2, AES, PCLMUL, FSGSBASE, RDRND, FMA, BMI, BMI2 and F16C.
Working with real numbers ( FPU ) is also indirectly related to additional instruction sets, because the compiler can use SSE for this. It is faster than x87 instructions and does not block MMX. A little more about this will be lower.
Another important feature should be noted. When the abbreviations SSE, MMX and AES are mentioned aloud, very often people familiar with these concepts have a picture popping up in their heads telling about the difficult life of C programmers who assembler support files for these instructions in their software. In fact, there are already 3 ways to use these sets of instructions: automatically by the compiler for static code analysis, manually using special compiler functions, and manually using assembler inserts (for example: How to optimize code for MMX processors ). In which particular cases GCC automatically uses sets of instructions, if they are allowed, is not clear, but the manual clearly states that there are such cases (hopefully they will be written in the comments).
Static analysis code optimizations are best covered on the Options That Control Optimization page . There are a lot of possibilities, but in order not to get confused, they are grouped into meta flags: O0, O1, O2, O3, Ofast. More information on these flags can be found below in linked articles.
Optimizations for a better hit in the processor cache . It’s not possible to quickly explain, so I am sending the reader to another article: Bubbles, caches, and branch predictors . I can only say that programmers can use Intel VTune Performance Analyzer and AMD CodeAnalyst to analyze places that can be optimized . And, like, ICC Intel C ++ compilerknows how to do such optimizations in some cases automatically, but how are things with GCC today, I hope knowledgeable people will add in the comments.
Kernel-level code optimizations . They allow speeding up functions in the Cryptographic API Framework, such as AES, Twofish and others, using additional sets of instructions, such as: SSE, AVX, AES. These functions can be used in other kernel modules, as well as called externally from applications.
We figured out the theory, let's move on to how it is used.
If you have ubuntu
Suppose you are sitting on Ubuntu. Depending on the size of the operating system, you can choose from packages with the suffix i386 or amd64 ( example ). i386 does not mean at all that the package will work on any processor, starting from 386, it simply means that the package is intended for x86 32-bit platform. In turn, amd64 means support for the x86-64 64-bit platform. We can easily check this if we type in the console:
gcc -dumpmachine
On 32-bit Ubuntu 12.04 LTS Server we will see i686-linux-gnu , and on 64-bit we should see x86_64-linux-gnu .
Suppose you have a 32-bit Pentium 4, you can use MMX, SSE and SSE2, but they were not used to generate packages, since these packages should work on Intel Celeron, where there is only MMX, and possibly even on Pentium Pro, where there is no even MMX.
Additional instruction sets will be used only in packages that themselves determine the processor on the fly and include a faster algorithm for this processor. The good news is that this happens in almost all multimedia packages.
It is also not clear with which code optimizations 32-bit Ubuntu was going to. If you look at the output of GCC, that is, a little from -O1, and from -O2 and from -O3. If packages for Ubuntu for a specific version are usually assembled on the system itself with default compilation options, then apparently they are not collected in the most optimal (from rational) way.
Finally, the functions in the kernel Cryptographic API are not optimized. Optimized functions for additional instruction sets are present in the system only in the form of modules, and only for i586 and AES (for VIA Nano), but are not loaded by default. It is also not clear what of 586 can be used for optimizations.
On Ubuntu 12.04 64-bit, things are much better. First: gcc uses extensions by default for 64-bit systems: MMX, SSE, SSE2, so the code can be slightly optimized. Secondly, for x86-64, the default is -mfpmath = sse , which speeds up arithmetic for real numbers.
Optimized kernel Cryptographic API functions for additional instruction sets are present in the system in modules, but are not loaded by default. At least they can be turned on.
And finally, gcc collects packages with the same strange set of optimizations as for Ubuntu 32-bit.
If you have gentoo
Most likely you read this page from the manual . So you set yourself -O2 and -march = native (or the correct processor). But most likely you are in the first place: you did not enter the Cryptographic API when setting up the kernel and did not speed up some instructions for yourself, but at least it is worth speeding up AES. Secondly: you most likely did not set USE flags for additional processor instructions that are available to you: 3dnow, mmx, sse, sse2, sse3. Or not all of them were exposed. This means that for applications that intentionally activate optimizations in USE flags, you are left without additional acceleration.
In addition to global flags, there are also local flags that trigger additional instructions for some applications. Such as: 3dnowext, ssse3, sse4, sse4_1, avx, avx128fma, avx256 and aes-ni. Everything that is supported by you is better to expose too.
The modern stage3 under amd64 defaults to: bindist, mmx, sse, sse2. Unfortunately, bindist disables additional instructions in some packages for portability. If you need a bindist, use cpudetection in addition to mitigate the flaws of the bindist flag in some applications.
In what Gentoo packages can I get a boost?
app-arch / libzpaq
app-emulation / bochs
media-libs / freeverb3 (audio)
media-libs / libpostproc (video)
media-libs / libvpx (video VP8)
media-plugins / vdr-softdevice (video)
media-sound / mpg123
media-video / ffmpeg
media-video / libav
media-video / mplayer
media-video / mplayer2
media-video / vlc
net-libs / cyassl
net-misc / bfgminer (bitcoin)
sci-biology / raxml
sci-libs / fftw
sci- chemistry / gromacs
sys-fs / loop-aes
x11-libs / pixman
Additionally, the orc flag, which is already set by default, helps enable additional processor instructions in:
media-libs / gstreamer (audio + video)
Additionally, the cpudetection flag, which is not set by default, helps enable additional processor instructions on the fly in:
media-sound / jack-audio-connection-kit
media-video / ffmpeg
media-video / libav
media-video / mplayer
media-video / mplayer2
sci-libs / mpir
conclusions
- On 32-bit systems, the largest gain from Gentoo can be obtained on the latest processors.
- In 64-bit systems, the gain from Gentoo can be obtained only through the use of newer versions of the compiler and optimization -O2.
- Even Gentoo, even after reading the official documentation does not interfere with the adjustment.
- Not Gentoo can be sped up too.
Additional material
- Optimal Options for x86 GCC
- GCC x86 how to reduce code size
- "Why upgrade the GCC compiler?" ...
- Optimizing GCC Compilation with Gentoo as an Example
- Gentoo: Compilation Optimization Guide
- Hardware support for AES algorithm by modern processors
- How to Customize Your Ubuntu Kernel
- Ubuntu: Supported Hardware
- Ubuntu: Loadable_Modules
- Bubbles, caches, and branch predictors
- Ruby on your server may run 2 times slower due to RVM
ICC Material
- Using Alternative Compilers in Gentoo with Intel Compiler Suite as an Example
- Gcc vs Intel C ++ Compiler: building FineReader Engine for Linux
Add-on : User kekekeks shared a recipe on how to rebuild some packages for Ubuntu with optimization.