What is SMT and how does it work in applications - pros and cons

    While I am pleased with my graphomaniac addictions by writing a detailed technical article about “Windows Performance Station”, I wanted to share my thoughts about what good and bad SMT brings to the processors “AMD” and “Intel”, and how “Windows Performance Station” will help here.

    image

    For those who are interested in this topic, welcome to cut ...


    So, for starters, let's decide what SMT is.

    As Wikipedia tells us, SMT (from the English simultaneous multithreading) is simultaneous multithreading, i.e. several threads are executed simultaneously, and not sequentially, as is the case in “temporary multithreading”.

    Many people know this technology under the name "Intel Hyper-Threading", everything has been written about it for a long time, but so far I have come across the fact that many developers, and, moreover, ordinary people, do not understand the main essence of the "simultaneous" execution of several commands one processor core and what problems it brings.

    First, let's talk about temporary multithreading. Prior to the implementation of SMT technology in the form of "Hyper-Threading", the technology of "temporary multithreading" was used.

    Everything is simple here, imagine that we have one pipeline and one worker (CPU core), which performs operations on numbers and writes the result. Suppose he needs a screwdriver and a wrench for these operations. The operating system (OS) puts our worker on the conveyor in order one operation for a screwdriver, and behind it one operation for a wrench. One worker at a time can operate either with a wrench or with a screwdriver only. Thus, laying out a different number of different blocks, the OS determines the priority of performing certain operations from different applications. We can indicate the proportion of one block to another within the OS when we indicate the priority of the process. This is exactly what all task managers do, including and Windows Performance Station.

    image

    With the advent of SMT, the situation becomes a little more complicated.

    Imagine a conveyor and two workers who have one screwdriver and one wrench for two. At the same time, each of them can operate with either only a screwdriver or only with a wrench. One conveyor is conditionally divided into two halves along. SMT allows you to add two numbers onto such a conveyor at once, one for working with a screwdriver and the second for working with a wrench, so the actions of these workers look like this:

    - The first worker receives an operation for screwdriver, and the second, standing opposite, at the same time , the operation for the wrench, after which both record the result.

    image

    Based on this, when the operation (A and B) is on the conveyor on the one hand and (D and E) on the other hand, everything is fine, but when parallelizing the computation chain, two problems can occur:

    1. On the one side of the conveyor is the action (A and B) = C, and on the other (D and E) = C,
    i.e. First, write one value of C, and then the second value of C, but not at the same time (control conflict).

    2. On one side of the conveyor, the action (A and B) = C turned out, and on the other (A and C) = D,
    i.e. you must first calculate C, and then calculate D, but not at the same time (data conflict).

    Both conflicts cause a delay in the execution of instructions and are solved by the sequential execution of commands. To reduce such delays, processor elements called transition predictor and processor cache were introduced.

    The predictor of transitions, as the name implies, makes a prediction :) It predicts the probability of the first problem occurring, when different transformations should occur on one number.

    In turn, the processor cache is necessary to quickly solve the second problem, when we stop solving the expression (A and C) = D and write the result of the execution (A and B) = C in the cache, and then immediately calculate (A and C) = D.

    In fairness, it’s worth clarifying that the problem of pipeline parallelization also appears in multi-core processors without SMT, but multi-core ones do not have a moment of processor idle when there is one screwdriver for two workers, because In this terminology, each worker has his own screwdriver and his own wrench.

    image

    All these dances around the processor guessing how to parallelize the current operations lead to serious energy losses and tangible freezes when various types of tasks are starving on SMT cores.

    In general, it is worth keeping in mind that Intel developed Hyper-Threading simultaneously with the creation of its first multi-core Xeon processors, and, in fact, this technology can be considered a kind of compromise when a double pipeline for one core is installed.

    At the suggestion of marketers, it is customary to praise how well a single core can perform several tasks at the same time and how productivity improves "in some usage scenarios", but it is customary to keep silent about the problems inherent in the SMT concept.

    It is noteworthy that on the Intel website, the commercial shows more dual-core rather than Hyper-Threading, those who have read up to this point have probably already guessed why :)

    Image from video:

    image

    More accurate image:

    image



    What conclusion can be made here and what can be improved?

    The feature of displaying the loading of logical cores in Windows introduces confusion about information about the actual workload of cores with SMT. If you see that two neighboring cores are busy ~ 50%, this can mean two things:
    1) both cores perform two parallel calculations and are loaded at 50% (everything is ok here).
    2) both cores perform one calculation alternately (as if two workers passed a wrench to each other through a clock cycle).
    Therefore, if you see that all the cores of your processor with SMT are loaded at 50% and the load does not rise higher, most likely this means that the processor is utilized at 100% but it is busy with the same type of task that it cannot be divided for SMT!

    Along with obvious advantages, SMT brings friezes to tasks that are time-sensitive (playing video / music or FPS in games). That is why, many gamers observe a drop in FPS when SMT / Hyper-Threading is enabled.

    As I wrote earlier, the Windows Performance Station application can sort the blocks laid out on the pipeline even at the stage of processing tasks by the OS kernel. Using priorities and dividing processes by processor cores, you can lay out certain blocks on the conveyor in the right amount and put different types of blocks for different virtual cores so that different types of tasks do not starve. It is for this dynamic analysis task in “Windows Performance Station” that we combined the neural network and the task manager. As a result, the neural network analyzes the task and decomposes it depending on the received data according to different rules, so that each core in the SMT pair performs different tasks.

    image

    Thanks to this approach, SMT processors in Windows can work more efficiently with multitasking and multi-threaded processes. And that is why we were very pleased with the appearance of SMT in the new AMD Ryzen processors.

    The “Windows Performance Station” application is free and contains no ads, it can be downloaded from our website using the link in the spoiler:


    You can read more about Windows Performance Station in my previous article.
    Windows Performance Station or how I taught a computer to work efficiently


    Many thanks to everyone who read to the end.

    Also popular now: