H.265 / HEVC. Optimization for Intel architecture

Original author: Yang Lu
  • Transfer

The current situation in the field of media codecs can be described in just a few words: simple solutions have exhausted themselves. Every year, the material for coding is becoming more difficult, and the requirements for the quality of the result are becoming higher. In these conditions, when a frontal attack no longer has an effect, optimization of both encoding and media playback for specific platforms using their most advanced capabilities is of particular importance. What can be achieved by such optimization, we will show on the example of the promising H.265 codec. As a target platform, consider the Intel server solution - the Xeon processor.

Short description H.265 / HEVC

The H.265 / HEVC (High-Efficiency Video Coding) standard is the latest video codec standard developed jointly by the International Telecommunication Union ITU-T and ISO / IEC. The purpose of this standard is to increase compression efficiency and reduce data loss. H.265 / HEVC, compared to the previous H.264 / AVC standard, has twice as high compression ratio with equal subjective image quality. HEVC technology allows video providers to deliver high-quality video content with less network load.
Note the main functional innovations applied in H.265:
  • Special features for random access and splicing of digital streams. In H.264 / MPEG-4 AVC, the digital stream should always begin with an IDR address block, and random access is supported in HEVC.
  • The image is divided into units of the coding tree (CTU), each of which contains blocks of the coding tree (CTB) of brightness and color. In all previous video coding standards, a fixed array size was used for brightness samples - 16 × 16. HEVC supports CTB blocks of different sizes, which is selected depending on the needs of the encoder in terms of memory and processing power.
  • Each coding block (CB) can be recursively divided into transform blocks (TV). Separation is determined by the residual quad tree. Unlike previous standards in HEVC, a single TV block can span multiple prediction blocks (PB) for cross predicted coding units (CUs).
  • Directional prediction with 33 different orientation directions for transform (TB) blocks ranging in size from 4 × 4 to 32 × 32. The possible direction of prediction is all 360 degrees. HEVC supports various intra-frame prediction coding techniques.

H.265 / HEVC places extremely high demands on computing power on both client devices and internal transcoding servers.

HEVC Performance Issues

The existing HEVC Test Model (HM) project implements only the basic functionality of the standard; actual performance is still far from necessary in a real environment. Two main disadvantages of this project:
  • Lack of parallel circuitry.
  • Inefficient vectorization setting.

Figure 1. HM project profile - parallel operation of threads

Figure 2. HM project profile - resource-intensive code

This HEVC codec consumes, compared to H.264, 100 times more CPU resources on the server side and 10 times more on the client side.
The H.265 / HEVC codec attracted the attention of many companies and organizations around the world, which led to its performance optimization and actual development. There are several open source projects.
  • OpenHEVC (compatible with HM10.0, decoder optimization)
  • x265 (compatible with HM, parallelization and vectorization)

To evaluate the performance of the x265 encoder on a platform with Intel® Xeon® processors (E5-2680, 2.7 GHz, 8 * 2 physical cores, codenamed Sandy Bridge), we launched video with a resolution of 720p and a frame rate of 24 frames per second. The x265 developers did a great job to optimize the original standard in order to parallelize the processing of tasks and data. However, our test showed that the codec can only use 6 cores in a system with 32 logical cores (with SMT enabled). Thus, the codec does not fully utilize the resources of modern multi-core platforms.

Figure 3. CPU utilization in an X.265 project

Figure 4. X.265 project with Intel® SIMD configuration

The Intel® SIMD (compiler auto-generation) instructions were also used in the x265 project, resulting in a performance increase of more than 70%. Together with further optimization by compiler options, the Intel compiler provides a doubling of performance on the IA platform. However, the encoder performance is still significantly lower than that required by the real-time encoder, especially for high-definition video with a resolution of 1080p.
Below we show the results achieved by the Chinese company Strongene with the support of Intel specialists in optimizing the H.265 / HEVC codec created by it for various Intel platforms.

Optimizing HEVC for Intel® Xeon ™ Platform

The bulk of the most demanding video and image processing functions are intensive block data calculations. You can use the Intel® SIMD vectorization instructions to optimize them. According to the profiling data, using the Intel SSE instructions in the encoder within the Strongene codec, you can manually vectorize all the most resource-intensive functions, such as low-complexity frame interpolation with motion compensation; integer conversion without transposition; Hadamard Transformation; calculating the sums of absolute differences (SAD) / squares of difference (SSD) with the least memory overhead. We have included Intel SSE instructions as intrinsic functions, as shown in fig. 5.

Figure 5. Example of incorporating Intel® SIMD / SSE instructions in the Stongene codec.

Strongene developers rewrote all the resource-intensive functions to achieve the greatest increase in encoder performance. In fig. Figure 6 shows our profiling data in a 1080p video encoding scenario using HEVC. It can be seen that 60% of resource-intensive functions are processed by Intel SIMD instructions.

Figure 6. Results of profiling Strogene encoding functions.

Intel AVX2 instructions with 256-bit integer values ​​are twice as fast as the previous Intel SSE code working with 128-bit values. Intel AVX2 instruction set supported by platform
Intel Xeon (Haswell), which was launched in 2014. To evaluate the performance of Intel AVX2 built-in functions, we use the popular calculation of the sums of absolute differences for the 64 * 64 block.

Table 1. Intel® SSE and Intel® AVX2 Implementation Results
CPU cycles Source Intel® SSE Intel® AVX2
Launch 1 98877 977 679
Launch 2 98463 1092 690
Launch 3 98152 978 679
Launch 4 98003 943 679
Launch 5 98118 954 678
Average 98322.6 988.8 681
Acceleration 1.00 99.44 144.38

As can be seen from table 1, the application of the Intel SSE and Intel AVX2 instructions provides a 100-fold increase in performance, while the Intel AVX2 code additionally wins another 40% compared to the Intel SSE.
As we saw earlier, in most existing implementations, not all cores of multicore platforms are used. Building on the latest Intel Xeon multi-core architecture with parallel dependencies between CTB-based algorithms, Strongene developers have proposed replacing the original OWF and WPP methods with a parallel IFW structure and then developing a three-level flow control scheme to ensure that the IFW structure fully utilizes all CPU cores to speed up HEVC coding .

Figure 7. Parallel operation of threads and CPU usage in Strongene encoder

Through the use of the new parallel WHP structure and the full implementation of Intel SIMD instructions at the task and data levels, respectively, the Strongene encoder developers managed to achieve a very significant increase in performance on x86 processors for video with a resolution of 1080p, using the computing resources of all the cores, as shown in Fig. 8.

Further configuration using SMT / HT

Also of interest is the dependence of the codec performance on the inclusion of simultaneous multithreading (SMT), also called hypertreading (HT) technology, which is widespread on all platforms with Intel architecture.

Table 2. Speed Strongene HEVC encoding on Intel® Xeon® platforms

As can be seen from the table (shown in yellow) on the platform Ivy Bridge (Intel Xeon E5-2697 v2 processor for disabled SMT HEVC encoding of 1080p video is done in real-time!

Having hugest To improve performance, we continued to explore Strongene HEVC coding capabilities on the Ivy Bridge platform, focusing on stream speed and quality issues

Table 3. Comparison of the performance of H.264 and H.265 codecs

Table 3 shows that the H.265 / HEVC codec reduces data by 50% while maintaining the same video quality.

H.265 / HEVC is likely to become the most popular video standard in the next decade. Many multimedia applications and products currently support HEVC. In this document, we implemented a CPU-based, full-featured, real-time HEVC solution on Intel platforms with new IA technologies. Our optimized Intel processor-based solution has been deployed at Xunlei, an Internet video services company, and will facilitate the widespread adoption and dissemination of H.265 / HEVC technology.

Also popular now: