Monsters after the holidays: AMD Threadripper 2990WX 32-Core and 2950X 16-Core

Transfer

New AMD Product Stack

Kingdom for high performance. When it comes to data processing, bandwidth becomes a key factor: after all, the more a user has time to do, the more projects will be completed, respectively, and the number of contracts will increase. Workstation users often calculate bottlenecks in the system and like to throw resources at solving problems, be it kernels, memory, or graphics acceleration. The second-generation Threadripper, known as the Threadripper 2, goes beyond the old limits on the ratio of cores and price: 2990WX provides 32 cores and 64 threads for only $ 1,799. We checked both.

AMD Threadripper 2990WX 32-Core and 2950X 16-Core Review

Since AMD released its first Ryzen first generation with eight cores against four Intel cores, there have been long discussions on how many cores it makes sense to be. The answer to this question depends entirely on the workload: how many different tools the user intends to use at the same time. Since the workstation market covers a wide range of “heterogeneous” users (and, despite the need for speed), providing a single, convenient for all option is simply unrealistic.

AMD's first generation Threadripper, released in 2017, brought to the masses 16-core processors. Previously available only on server platforms, the new components were rated as very competitive against 10-core offers. AMD used its server platform with small tweaks to attack competitors and their leader Halo.

Intel's own workstation products, previously referred to as E5-2687W and based on two-socket servers, were, quite simply, servers. After launching its latest high-performance desktop platform with up to 18 cores, Intel launched the Xeon W series, replacing the previous generation E5-W components. Up to 18 cores for ~ 2500 dollars, although their use required special chipsets and motherboards.

Today, AMD officially launches the second-generation Threadripper. New processors rush into the market extremely aggressively: by offering an improved Zen + microarchitecture, we get a 3% increase in IPC performance; 12 nm technical process is used, which in turn increases the frequency and reduces power consumption. AMD is attacking the market with a number of cores! Not only are the 12 and 16-core processors replaced with new Zen + models at higher frequencies, the company offers 24 and 32 cores in a processor worth up to $ 1,799. 32 cores for $ 1,799 versus 18 cores for almost $ 2,500 - a good blow to competitors, isn't it?

How AMD supports 32 cores

To be called a 32-core processor, AMD's first-generation server processor line, called EPYC, uses four silicon arrays of eight cores each. These components have eight memory channels and 128 PCIe 3.0 lanes for various purposes. With the release of the first generation Threadripper, AMD disconnected two of these silicon arrays, giving only 16 cores, four memory channels and 60 PCIe lanes. The final product was targeted at retail consumers.

To provide users with 32 cores, AMD uses the same 32-core EPYC silicon, but updates it to Zen + at 12 nm for higher frequency and lower power. It is slightly trimmed for compatibility with the first generation: four memory channels and 60 PCIe lanes. Although AMD is positioning the product as an updated first-generation processor with a large number of cores, and not a stripped-down server version. This approach is easily explained by product segmentation. This is a tactic that both companies have already used to market an expanded product line.

As a result, one of the ways to perceive the new 32-core and 24-core chips of the second generation is two-module: half of the chip has access to full resources, similar to the first-generation product, while the other half of the chip duplicates the same computing resources, but has additional memory latency and PCIe compared to the first half. For any user who is puzzled by computing power, not memory or PCIe, AMD is the best solution.

In our review we will see that this bimodal construction has a significant impact on performance, both good and bad, all again depends on the type of workload.

New AMD stack

AMD is officially entering the market with four second-generation Threadrippers. Two of them will directly replace first-generation products: the 16-core 2950X to replace the 16-core 1950X, and the 12-core 2920X - to the 12-core 1920X. Two new processors will not be two-module, only two of the four silicon crystals on the package are active (16-core configuration looks like 8 + 0 + 8 + 0, 12-core like 6 + 0 + 6 + 0). At the bottom of the stack will be the first generation of 8-core (4 + 0 + 4 + 0) 1900X, which offers four-channel memory and 60 PCIe lanes.

Two new processors are represented by 32-core 2990WX and 24-core 2970WX. They will include four cores per complex (8 + 8 + 8 + 8) and three cores per complex (6 + 6 + 6 + 6), respectively, have the already described two-module nature of memory and PCIe. Branding is changing, now it's WX, presumably for Workstation eXtreme. This puts the product on the same marketing line as the Radeon Pro WX family.

The AMD Ryzen Threadripper 2990WX is a new super-product with 32 cores and 64 threads, with a base frequency of 3.0 GHz and a top turbo frequency of 4.2 GHz. The downtime frequency of this processor is 2.0 GHz. When testing, we saw 2.0 GHz on each core without load.

Another product of the WX series is 2970WX: it disables one core per complex, and offers a total of 24 cores. With the same frequencies as the 2990WX, and with the same TDP, PCIe lines and memory support, this processor will be launched in October at a price of $ 1,299. With a smaller number of loaded cores, we can expect that this processor will work in turbo more often, than the big 32-core brother.

As for the X series, the TR 2950X is a 16-core replacement. The processor takes full advantage of the fast frequencies that the new 12-nm process can bring: the base frequency of 3.5 GHz and the turbo 4.4 GHz put the previous generation product on their knees. In fact, the 2950X looks like a well-dispersed AMD Ryzen. A considerable advantage at a reduced price: instead of $ 999, users can now get a 16-core processor for $ 899. 2950X will be released at the end of the month, on August 31.

And finally, we mention 2920X, the replacement for 1920X, and offering the same improvements as the other processors in the line. As in the case of the 2950X, the frequencies are well increased compared to last year, the base frequency is 3.5 GHz and the turbo is 4.3 GHz. All this beauty in a package with a thermal design of 180 watts. 2920X will be released in October, for a retail price of $ 649.

Nucleolus to the nucleolus, or design compromises

AMD's approach to these large processors is to take a small repeating unit, such as a 4-core complex or an 8-core silicon crystal (which includes two complexes), and place several in a single processor. "At the exit" the required number of cores and threads. Among the benefits - it turns out a lot of replicated blocks, such as memory channels and PCIe lanes. The downside is how these cores and memory should communicate with each other.

In the standard monolithic (single) silicon design, each core is located on the internal interconnector with the memory controller and can go into the main memory with low latency. The exchange rate between the cores and the memory controller is usually quite low, and the routing mechanism (ring or mesh) can determine throughput, latency, and scalability. The final performance is usually a compromise between the listed factors.

In the construction with several crystals, in which each stamp has access not only to a specific memory locally, but also to another memory using a jump, we are faced with an uneven memory architecture. It is known as NUMA design. In such a case, performance may be limited by this abnormal memory delay. Therefore, the software must be “NUMA-aware” in order to optimize both latency and throughput. Do not forget that the additional transitions between the matrix and the memory controllers take a certain computational power.

We came across this earlier in the first generation of the Threadripper (the presence of two active silicon matrices in the package). If the required data was in memory local to another silicon, a jump was necessary. With the second generation Threadripper this jump becomes much more difficult.

On the left is the 1950X / 2950X design with two active silicon matrices. Each matrix has direct access to 32 PCIe lanes and two memory channels, which add up to 64/4 PCIe lanes and four memory channels. In turn, the kernels that work with memory / PCIe and are connected to their matrix, work faster than using resources connected to another matrix.

In 2990WX and 2970WX, two “inactive” silicon are included, but do not additionally have direct memory access or PCIe. For these cores, there is no “local” memory or connection: each access to main memory requires an additional transition. In addition, there are additional interconnects from the matrix to the matrix based on AMD Infinity Fabric (IF), which consume energy.

The reason why these additional cores do not have direct access lies in the platform: the TR4 platform for Threadripper processors uses four-channel memory and 60 PCIe slots. If the other two matrices include local memory and PCIe, new motherboards and memory devices will be required.

Users may wonder if we can change the design so that each silicon crystal has one memory channel and one set of 16 PCIe bands? Quite possibly. However, the platform is somewhat limited by how the pins and tracks on the sockets and motherboards are controlled. The firmware expects two memory channels for each silicon, besides this, there are reasons related to the power supply. The current motherboards on the market are simply not configured in this way. This fact will have a serious impact on performance, so keep this in mind when we get to the tests.
It is worth noting that this second-generation Threadripper and AMD server platform, EPYC, are brothers. They both have the same processor and socket layouts, but EPYC includes all memory channels (eight) and all PCIe lanes (128):

And if the Threadripper 2 loses in performance due to the presence of several cores without direct memory access, then EPYC has available direct memory. The processor requires more power, but offers a more uniform configuration of traffic from the core to the network.

Returning to the Threadripper 2, it is important to understand how the chip will boot. AMD confirmed that for the most part the scheduler will first load the kernels that are directly tied to memory before using other cores. It turns out that each core has a “weight” of priority, based on performance, thermal performance and power. In priority, those that are closest to memory. The priority of the cores decreases as they are filled, due to thermal inefficiency.

Precision boost 2

The exact turbo timings for each new processor will now be determined by the AMD voltage frequency scaling functionality using Precision Boost 2. This feature, which we discussed in detail in the Ryzen 7 2700X review, relies on the available power to determine the frequency, instead of a discrete reference table of voltages and frequencies based on load. Depending on the initial capabilities of the system, the frequency and voltage will be dynamically shifted in order to use more potential power available at any moment of the processor load.

The processor can use more power than the fixed lookup table allows, which should be suitable for all processors of this model.

Precision Boost 2 works in conjunction with the XFR2 (eXtreme Frequency Range), which responds to the available temperature stock. If there is an additional thermal budget provided by a good cooler, the processor can use more power before reaching the thermal limit, and receive an additional frequency. AMD claims that a good cooler in a cool environment can increase computing power by more than 10% on some tests, thanks to the use of XFR2 technology. Demonstrating this "plus" by running the Threadripper 2 in the middle of a hot period in Europe, AMD was difficult. Europe is known for ignoring air conditioning throughout the world, and when the ambient temperature exceeds 30ºC, productivity increases are limited. A Scandinavian review may show better results than a review from the tropics.

In the end, this complicates the testing of the Threadripper 2. With the turbo table, performance is tightly tied to the characteristics of each silicon element, which makes the power consumption a single gradation. With PB2 and XF2, no two processors will work the same way.

Fortunately for us, we conducted most of our tests at an air-conditioned hotel, thanks to the Intel Data-Centric Innovation Summit, which took place a week before the processors were launched.

Precision boost overdrive

The new processors support the Precision Boost Overdrive feature, which covers key areas such as “power”, “thermal design current” and “electrical design current”. If any of these three areas "demonstrate" unused potential, the system will try to increase both frequency and voltage to increase performance. PBO is a combination of “standard” overclocking, which accelerates all cores simultaneously, with the possibility of increasing the frequency on one core to achieve performance gains on average workloads. PBO allows you to save energy when the processor is idle and working with standard performance. Precision Boost Overdrive is enabled with the Ryzen Master.

These "three key areas" are defined by AMD as follows:

Package (CPU) Power, or PPT - the maximum allowable power consumption of a socket depends on the power supply to the socket;
Thermal Design Current or TDC - the maximum current supplied by the voltage regulator of the motherboard after reaching a stable state temperature;
Electrical Design Current or EDC is the maximum current supplied by the voltage regulator of the motherboard in peak state.

By expanding these limits, PBO extends the capabilities of PB2, which in turn makes it possible to load the system as efficiently as possible.

Storemi

Together with the new Ryzen Threadripper 2 processors, users can access the StoreMI software solution. It allows you to create custom multi-level storage by connecting DRAM, SSD and HDD into a single storage space. Software implementation dynamically allocates data using up to 2 GB DRAM, up to 256 GB SSD (NVMe or SATA) and a rotating hard disk. This approach provides the best reading and writing capabilities, with insufficient space on the high-speed storage device.

AMD initially offered this software as a supplement to the Ryzen APU platform for $ 20, and later - free of charge (up to 256 GB SSD) for users of Ryzen 2000 series processors. The offer now extends to Threadripper. AMD demonstrates how software ideally provides a 90% acceleration of download time.

Feed me: Infinity Fabric needs more power

When the movement of data between the cores and memory controllers changed from a ring topology to a mesh or chiplet, the connection between the cores became much more difficult. From this point on, each core or its environment should act as a router, and determine the best path for the data if several “hops” are needed to achieve the intended goal. As we saw with the Intel MoDe-X mesh when launching Skylake-X, you need to simultaneously avoid competition to increase performance and reduce the length of the conductors to reduce power. It turns out that in such systems the technology of inter-core communication begins to consume a lot of energy, sometimes more than the cores themselves.

To describe the power of the chip, all consumer processors have a nominal "TDP" or thermal design power. Intel and AMD measure this value differently based on workloads and temperatures. Technically, TDP is the thermal energy that the cooler must dissipate when the processor is fully loaded (and is usually determined at the base frequency, rather than the turbo frequency of all the cores). The actual energy consumption may be higher, depending on the power loss or heat dissipation through the board, but for most situations TDP and energy consumption are generally considered equal.

This means that TDP ratings on modern processors, such as 65W, 95W, 105W, 140W, 180W, and now 250W, should roughly show peak energy consumption. However, not all of this energy can go on increasing the frequency in the nuclei. Part of it will be used in memory controllers, in IO, in integrated graphics (if any on the chip). It turns out that internuclear connections become a full-fledged participant in power consumption. We want to know how much they consume.

To understand the scale, let's start with something straightforward and known to most users. The new Intel Coffee Lake processors, such as the Core i7-8700K, use the so-called ring bus design. These processors use one ring to connect each of the cores and the memory controller: if you need to move data, they fall into the ring and move until they arrive at their destination. The system of internuclear interactions is historically called “Uncore” and is capable of interacting with cores operating at different frequencies and scaling power as needed. The power distribution is as follows:

Despite the 95 W TDP, this processor at base frequencies consumes about 125 W at full load, which is much more than its TDP (also determined at the base frequency). We are interested in something else: the ratio of Uncore consumption to total power. With a small load, uncore consumes only 4% of the total power, but when loading additional cores, the power consumption increases to 7-9%. For simplicity, we will call this “a maximum of 10%.”

Now let's get to something bigger: the Intel Skylake-X processors. In this design, Intel uses its new mesh architecture (mesh), similar to MoDe-X. In it, each subgroup of processor elements has a small router that can route a packet of data to neighboring cores or to itself, as needed.

This design allows the processor to scale, given that the ring systems have an additional latency when it reaches 14 cores. Although the mesh architecture works with less latency than the ring systems previously used by Intel, it consumes much more power.

In this graph, we see that the power of uncore in the mesh architecture is already from 20% of the total power of the chip, and increases to 25-30% when loading additional cores. As a result, one quarter to one third of the power per chip is used for communication between the cores and the memory.

For AMD, the situation is something different. With one quad core, the connection between the cores is relatively simple and uses a centralized crossbar. When it comes to multiple cores, the communication method is simple and straightforward. However, when using at least two nuclear complexes on the same silicon or memory controller, interconnect takes effect. This topology is not exactly a “ring”, and is based on the internal version of Infinity Fabric (IF).

IF is designed to scale across cores, matrices, and sockets. We can figure out what it does within the unit matrix, using the example of the Ryzen 7 2700X, which has a TDP of 105 watts.

The AMD product found two very interesting moments. First, when the cores are lightly loaded, the IF consumption is 43% of the total processor power. And this is in comparison with 4% for i7-8700K and 19% for i9-7980XE. However, with a fully loaded chip, this 43% is reduced to about 25%.

Secondly, it is very important that the IF power almost does not change when scaling the cores, rising from ~ 17.6 W to ~ 25.7 W. For the powerful Intel chip, we have seen that in some cases it has grown from ~ 13.8 watts to more than 40 watts.

Ryzen Threadripper 2950X is an updated 16-core version of the Threadripper, which uses a single link between the two silicon arrays to exchange data between the nuclear complexes.

In the diagram below, the red line represents the consumption of IF. In this case, Uncore power consumption includes intranuclear interconnect + inter-core interconnect.

As a percentage, the consumption of Infinity Fabric is 59% of the total power consumption of the chip when loading only two streams. Although both streams are located in the same core (and in the same CCX), the CCX must have access to all system memory, therefore, the die-to-die connection and the intra-silicon interconnect are enabled.

Nevertheless, when loading additional cores, the power consumed by the IF barely increases from 34 W to 43 W, gradually reducing the percentage of consumption to about 25% of the total power of the chip, which is similar to 2700X.

Now we have to consider 2990WX. Since all four silicon matrices are included on the chip, and each layer needs an IF interconnect with each other, six IF lines are needed:

On the diagram below it is reddish. It is worth noting that two of the four silicon matrices do not have local DRAM. Theoretically, AMD should be able to disable these IF connections when only a few cores are used. After all, they will cause additional latency due to unnecessary transitions if other IF connections are overloaded. But in practice we are seeing something strange.

First, consider the performance at low load. Here, Infinity Fabric consumes 56.1 watts with a total power consumption of 76.7 watts, which is 73% of processor power. If one connection at 2950 W consumes only 34 W, it is obvious that additional IF lines are included here. Perhaps this is where additional power management options lie.

Investigating the graph, you will notice that our sample 2990WX never even approached the calculated TDP of 250 W, barely going beyond the 180 W mark at the peak. We do not know why this happened. As the load on the cores increases, the share of IF consumed power drops, gradually reaching 36%, and ranges from 35% to 40% depending on the specific workload. This is, of course, more compared to 25% at 2700X and 2950X.

Therefore, given that this is our first review with the participation of EPYC 7601, how about finding the second clue in this processor? Based on the already outdated Zen cores of the first generation, EPYC has additional memory controllers and IO, which also need to be powered, they all fall into the category of Uncore power consumption.

Considering the power consumption in numbers, like 2990WX, when we load all the cores, we observe a few broken graphs. The proportions of uncore consumption fluctuate.

With a low load, of the total chip power of 74.1 W, IF consumes 66.2 W and is a staggering 89%! As new cores are activated, the rate of 66.2 W rises to 90 W at some points. The core barely gets 90 watts out of 180 watts TDP!

The above leads to an interesting conclusion - if we are purely academically comparing the merits of one core with another, does the contribution of Uncore power be taken into account? For real analysis, undoubtedly yes, but for purely academic? Let me say the prophecy:

After the battle for the number of cores, the next battle will be for interconnect. Low consumption, scalability and high performance: scaling the processor node is nothing if Uncore accounts for 90% of the total chip power.

Thank you for staying with us. Do you like our articles? Want to see more interesting materials? Support us by placing an order or recommending to your friends, a 30% discount for Habr users for a unique analogue of the entry-level servers that we invented for you:The whole truth about VPS (KVM) E5-2650 v4 (6 Cores) 10GB DDR4 240GB SSD 1Gbps from $ 20 or how to share the server? (Options are available with RAID1 and RAID10, up to 24 cores and up to 40GB DDR4).

3 months for free if you pay for new Dell R630 for half a year - 2 x Intel Deca-Core Xeon E5-2630 v4 / 128GB DDR4 / 4x1TB HDD or 2x240GB SSD / 1Gbps 10 TB - from $ 99.33 a month , only until the end of August, order can be here .

Dell R730xd 2 times cheaper? Only we have 2 x Intel Dodeca-Core Xeon E5-2650v4 128GB DDR4 6x480GB SSD 1Gbps 100 TV from $ 249 in the Netherlands and the USA! Read about How to build an infrastructure building. class c using servers Dell R730xd E5-2650 v4 worth 9000 euros for a penny?

Tags: