MSVC Backend Updates in Visual Studio 2019 Preview 2: New Optimizations, OpenMP, and Build Throughput improvements
In Visual Studio 2019 Preview 2 we have continued to improve the C++ backend with new features, new and improved optimizations, build throughput improvements, and quality of life changes.
Original in blog
- Added a new inlining command line switch: -Ob3. -Ob3 is a more aggressive version of -Ob2. -O2 (optimize the binary for speed) still implies -Ob2 by default, but this may change in the future. If you find the compiler is under-inlining, consider passing -O2 -Ob3.
- Added basic support for OpenMP SIMD vectorization which is the most widely used OpenMP feature in machine learning (ML) libraries. Our case study is the Intel MKL-DNN library, which is used as a building block for other well-known open source ML libraries including Tensor Flow. This can be turned on with a new CL switch -openmp:experimental. This allows loops annotated with “#pragma omp simd” to potentially be vectorized. The vectorization is not guaranteed, and loops annotated but not vectorized will get a warning reported. No SIMD clauses are supported, they will simply be ignored with a warning reported.
- Added a new C++ exception handler __CxxFrameHandler4 that reduces exception handling metadata overhead by 66%. This provides up to a 15% total binary size improvement on binaries that use large amounts of C++ exception handling. Currently default off, try it out by passing “/d2FH4” when compiling with cl.exe. Note that /d2FH4 is otherwise undocumented and unsupported long term. This is not currently supported on UWP apps as the UWP runtime does not have this feature yet.
- To support hand vectorization of loops containing calls to math library functions and certain other operations like integer division, MSVC now supports Short Vector Math Library (SVML) intrinsic functions that compute the vector equivalents. Support for 128-bit, 256-bit and 512-bit vectors is available for most functions, with the exceptions listed below. Note that these functions do not set errno. See the Intel Intrinsic Guide for definitions of the supported functions.
- Vector integer combined division and remainder is only available for 32-bit elements and 128-bit and 256-bit vector lengths. Use separate division and remainder functions for other element sizes and vector lengths.
- SVML square-root is only available in 128-bit and 256-bit vector lengths. You can use _mm512_sqrt_pd or _mm512_sqrt_ps functions for 512-bit vectors.
- Only 512-bit vector versions of rint and nearbyint functions are available. In many cases you can use round functions instead, e.g. use _mm256_round_ps(x, _MM_FROUND_CUR_DIRECTION) as a 256-bit vector version of rint, or _mm256_round_ps(x, _MM_FROUND_TO_NEAREST_INT) for nearbyint.
- Only 512-bit reciprocal is provided. You can compute the equivalent using set1 and div functions, e.g. 256-bit reciprocal could be computed as _mm256_div_ps(_mm256_set1_ps(1.0f), (x)).
- There are SVML functions for single-precision complex square-root, logarithm and exponentiation only in 128-bit and 256-bit vector lengths.
New and Improved Optimizations
- Unrolled memsets and block initializations will now use SSE2 instructions (or AVX instructions if allowed). The size threshold for what will be unrolled has increased accordingly (compile for size with SSE2: unroll threshold moves from 31 to 63 bytes, compile for speed with SSE2: threshold moves from 79 to 159 bytes).
- Optimized the code-gen for small memsets, primarily targeted to initall-protected functions.
- Improvements to the SSA Optimizer’s redundant store elimination: better escape analysis and handling of loops
- The compiler recognizes memmove() as an intrinsic function and optimizes accordingly. This improves code generation for operations built on memmove() including std::copy() and other higher level library code such as std::vector and std::string construction
- The optimizer does a better job of optimizing short, fixed-length memmove(), memcpy(), and memcmp() operations.
- Implemented switch duplication optimization for better performance of switches inside hot loops. We duplicated the switch jumps to help improve branch prediction accuracy and consequently, run time performance.
- Added constant-folding and arithmetic simplifications for expressions using SIMD (vector) intrinsic, for both float and integer forms. Most of the usual expression optimizations now handle SSE2 and AVX2 intrinsics, either from user code or a result of automatic vectorization.
- Several new scalar fused multiply-add (FMA) patterns are identified with /arch:AVX2 /fp:fast. These include the following common expressions:
(x + 1.0) * y; (x – 1.0) * y; (1.0 – x) * y; (-1.0 – x) * y
- Sequences of code that initialize a __m128 SIMD (vector) value element-by-element are identified and replaced by a _mm_set_ps intrinsic. This allows the new SIMD optimizations to consider the value as part of expressions, useful especially if the value has only constant elements. A future update will support more value types.
- Common sub-expression elimination (CSE) is more effective in the presence of variables which may be modified in indirect ways because they have their address taken.
- Useless struct/class copies are being removed in several more cases, including copies to output parameters and functions returning an object. This optimization is especially effective in C++ programs that pass objects by value.
- Added a more powerful analysis for extracting information about variables from control flow (if/else/switch statements), used to remove branches that can be proven to be always true or false and to improve the variable range estimation. Code using gsl::span sees improvements, some range checks that are unnecessary being now removed.
- The devirtualization optimization will now have additional opportunities, such as when classes are defined in anonymous namespaces.
Build Throughput Improvements
- Filter debug information during compilation based on referenced symbols and types to reduce debug section size and improve linker throughput. Updating from 15.9 to 16.0 can reduce the input size to the linker by up to 40%.
- Link time improvements in PDB type merging and creation.
- Updating to 16.0 from 15.9 can improve link times by up to a 2X speedup. For example, linking Chrome resulted in a 1.75X link time speedup when using /DEBUG:full, and an 1.4X link time speedup when using /DEBUG:fastlink.
Quality of Life Improvements
- The compiler displays file names and paths using user-provided casing where previously the compiler displayed lower-cased file names and paths.
- The new linker will now report potentially matched symbol(s) for unresolved symbols, like:
main.obj : error LNK2019: unresolved external symbol _foo referenced in function _main Hint on symbols that are defined and could potentially match: "int __cdecl foo(int)" (?foo@@YAHH@Z) "bool __cdecl foo(double)" (?foo@@YA_NN@Z) @foo@0 foo@@4 main.exe : fatal error LNK1120: 1 unresolved externals
- When generating a static library, it is no longer required to pass the /LTCG flag to LIB.exe.
- Added a linker option /LINKREPROTARGET:[binary_name] to only generate a link repro for the specified binary. This allows %LINK_REPRO% or /LINKREPRO:[directory_name] to be set in a large build with multiple linkings, and the linker will only generate the repro for the binary specified in /linkreprotarget.
We’d love for you to download Visual Studio 2019 and give it a try. As always, we welcome your feedback. We can be reached via the comments below or via email (firstname.lastname@example.org). If you encounter problems with Visual Studio or MSVC, or have a suggestion for us, please let us know through Help > Send Feedback > Report A Problem / Provide a Suggestion in the product, or via Developer Community. You can also find us on Twitter (@VisualC) and Facebook (msftvisualcpp).