History of commands of one processor. Part 1. Differences between assembler instructions lddqu and movdqu


    In the not-so-distant 2000, Intel introduced the NetBurst microarchitecture for Pentium 4 processors in the market . In 2004, when Prescott-based processors appeared, the LDDQU command was implemented in the SSE3 instruction set.

    However, it was intended for one area of ​​application, namely, video encoding, and if in detail:

    The largest amount of computation when encoding video, as a rule, requires a motion estimation mechanism ( Motion Estimation- ME), which compares the blocks of the current frame with the blocks of the previous frame and searches for the best match. Many metrics can be used to find the best fit. The most common is the L1 metric - the sum of the absolute differences. The ME mechanism works so that the block loads of the previous frame are not aligned, while the block loads of the current frame are aligned. Unaligned downloads cause two types of delays due to:
    • processing costs for access to unaligned data;
    • Costs due to cache line split.

    NetBurst microarchitecture does not support micro-operations for loading 128-bit unaligned data. For this reason, commands for 128-bit unaligned downloads, such as movups and movdqu, are emulated in microcode using two 64-bit downloads, the results of which are combined into a 128-bit result. In addition to the cost of emulating unaligned downloads, it costs the processing of split cache lines if access goes beyond a 64-byte boundary.
    To solve the problem of splitting cache lines for 128-bit unbalanced downloads, the lddqu command was added to the SSE3 command set. This command loads a 32-byte block aligned on a 16-byte boundary and extracts 16 bytes corresponding to unaligned access. Since the command loads more bytes than requested, certain usage restrictions are imposed. The lddqu command should not be used in areas of the memory address space with non-cacheable memory (Uncached - UC) and combined write (Write-Combining - USWC). In addition, due to the nature of the implementation of the lddqu command, it should not be used in situations where read write redirection is expected. In situations where only loading is performed and the address space of the UC and USWC memory is not used,
    The code below is an example of using the new command. Both code sequences are similar, except that the old unaligned command (movdqu) is replaced with the new command (lddqu). Assuming that 25% of unaligned downloads go through the cache line, the new command can increase the performance of the ME engine by 30%. MPEG-4 encoders have demonstrated acceleration of more than 10%.

    Motion Estimator mechanism without SSE3: Motion Estimator mechanism with SSE3: More details are available here: download.intel.com/technology/itj/2004/volume08issue01/art01_microarchitecture/vol8iss1_art01.pdf And also, the most interesting discussions:
    movdqa xmm0, <текущий кадр>
    movdqu xmm1, <предыдущий кадр>
    psadbw xmm0, xmm1
    paddw xmm2, xmm0



    movdqa xmm0, <текущий кадр>
    lddqu xmm1, <предыдущий кадр>
    psadbw xmm0, xmm1
    paddw xmm2, xmm0







    In summary, we can say that starting with the appearance of the Intel Core 2 model (this applies to the Core microarchitecture, which appeared in mid-2006, and to Merom processors and later), and for all future models, the lddqu command performs the same actions as the movdqu command .
    In other words, if the processor supports the Supplemental Streaming SIMD Extensions 3 (SSSE3) instruction set, then the lddqu command performs the same actions as the movdqu command. If the processor does not support the SSSE3 instruction set, but supports SSE3, then use the lddqu command (and do not forget the details about the types of memory used).

    And finally, with regard to patents: pay attention to the presence of a patent with the number 6721866, which also describes some details of the implementation and use.

    PS: For reference, pay attention to a useful article that collects data on all Intel microarchitectures: en.wikipedia.org/wiki/List_of_Intel_CPU_microarchitectures
    (in other words, as always - Wikipedia)

    Also popular now: