We increase disk mass without steroids. Review of the Western Digital Ultrastar Data102 102-disk shelf and storage configuration



    What makes big JBODs good?


    The new JBOD Western Digital on 102 12TB discs turned out to be powerful. When developing this JBOD, previous experience with two generations of 60-disk shelves was taken into account.
    Data102 turned out to be extremely rare for such giants balanced in volume and performance.

    Why do we need such large disk baskets when hyperconverged systems are growing in popularity?

    Tasks in which the requirements for storage volumes significantly exceed the requirements for computing power can inflate the customer’s budget to incredible sizes. Here are just a few examples and scenarios:

    1. Replication Factor 2 or 3, used in the construction of Scale-out systems, on several petabytes of data is not an expensive solution.
    2. Intensive sequential read / write operations force the cluster node to go beyond the local storage, which can cause problems such as long-tail latency. In this case, you should be extremely careful about building a network.
    3. Distributed systems do an excellent job of tasks such as “many applications work with many of their files” and mediocre in writing and reading from a tightly connected cluster, especially in N-to-1 mode.
    4. For tasks such as "increase the depth of the video archive by 2 times" it is much cheaper to throw a large JBOD than to increase the number of servers in the cluster by 2 times.
    5. Using external storage systems with JBOD, we can clearly isolate the volume and performance for our priority applications, having reserved certain disks, cache, ports for them, while maintaining the necessary level of flexibility and scalability.

    As a rule, disk shelves of the Data102 level are developed by disk manufacturers who understand well how to work with these disks and know all the pitfalls. In such devices, everything is in order with the level of vibration and cooling, and the power consumption meets the real needs for data storage.

    What makes Western Digital's JBOD good?


    We are well aware that modular systems are limited in scalability by the capabilities of controllers and that the network always creates delays. But at the same time, such systems have a lower cost of IOps, GBps and TB storage.

    There are two things that RAIDIX engineers love Data102 for:

    1. JBOD doesn’t just allow you to place> 1 PB of data on 4U. It is really very fast and in streaming operations is not inferior to many all-flash solutions: 4U, 1PB, 23 GB / s - good performance for a disk array.
    2. Data102 is easy to maintain and requires no tools, such as a screwdriver.

    Our test team hates screwdrivers that they already dream about at night. When they heard that HGST / WD was making a 102-disk monster, and imagined how to deal with 408 small cogs, hard liquor ran out in a nearby store.

    They were in vain afraid. Caring for the engineers, Western Digital came up with a new way to mount the drive, which makes it easier to maintain. Disks are attached to the chassis using fixing clips, without bolts and screws. All discs are mechanically insulated with elastic fasteners on the rear panel. New servo firmware and accelerometers perfectly compensate for vibration.

    What is in our box?


    In the box is the basket case, stuffed with disks. You can buy at least 24 disks, and the solution scales with sets of 12 disks. This is done in order to ensure proper cooling and to combat vibration in the best way.

    By the way, it was the development of two assistive technologies - IsoVibe and ArcticFlow - that made possible the birth of a new JBOD. IsoVibe

    consists of the following components:

    1. Specialized disk firmware, which with the help of sensors controls servos and predictively reduces the level of vibration.
    2. Vibration-isolated connectors on the back of the server (Fig. 1).
    3. And, of course, special disk mounts that do not require screws.


    Fig. 1. Vibration-isolated connectors

    Temperature is the second factor after vibration that kills hard drives. At an average operating temperature above 55C, the mean time between failures of the hard drive will be half that calculated.

    Poor cooling is especially affected by servers with a large number of disks and large disk shelves. Often the back rows of discs are heated 20 or more degrees than the discs located near the cold corridor.

    ArcticFlow is Western Digital’s patented shelf cooling technology, the purpose of which is to create additional air ducts inside the chassis that allow cold air to be drawn to the rear rows of discs directly from the cold corridor, bypassing the front rows.


    Fig. 2. How ArcticFlow works

    A separate stream of cold air is built to cool the I / O modules and power supplies.

    The result is a great thermal map of a working shelf. The temperature spread between the front and rear rows of discs is 10 degrees. The hottest disc is 49C at a temperature in the “cold” corridor + 35C. 1.6W is spent on cooling each drive - half as much as other similar chassis. Fans are quieter, vibration is less, drives live longer and run faster.


    Fig. 3. Temperature card Ultrastar Data 102

    Given the budget for power supply of 12 W per drive, the shelf can easily be made hybrid - 24 of the 102 drives can be SAS SSD. They can be installed and used both in hybrid mode and by configuring SAS Zoning and transferring them to a host that needs all-flash.

    We also have a slide in the box for rack mounting. To install JBOD, you need a couple of physically strong engineers. This is what they will face:

    • The shelf assembly weighs 120 kg, and without disks - 32 kg
    • The deep stand in this case starts from 1200 mm
    • Well, add SAS and power cables
    .
    JBOD mounts and cabling are designed so that hot service can be performed. We also note the vertical installation of the input-output module (IOM).

    Let's take a look at this system. At the front, everything is simple and concise.


    Fig. 4. Ultrastar Data 102. Front view

    One of the most interesting features of JBOD is the installation of IO-modules from above!


    Fig. 5. Ultrastar Data 102


    Fig. 6. Ultrastar Data 102. Top view


    . 7. Ultrastar Data 102. Top view without drives

    At the back of the JBOD, for each IO module, there are 6 SAS 12G ports. Total we get 28800 MBps backend bandwidth. Ports can be used both for connecting to hosts, and partially for cascading. There are two ports for powering the system (80+ Platinum rated 1600W CRPS).


    Fig. 8. Ultrastar Data 102. Rear view

    Performance


    As we said, Data102 is not just huge - it is fast! The test results of the vendor are as follows:

    On 12 servers:
    Serial load
    • Read = 24.2GB / s max. @ 1MB (237 MB / s per HDD max.)
    • Write = 23.9GB / s max. @ 1MB (234 MB / s per HDD max.)

    Random load
    • Read 4kB with queue depth = 128:> 26k IOps
    • 4kB recording with queue depth 1–128:> 45k IOps

    On 6 servers:
    Serial load
    • Read = 22.7GB / s max. @ 1MB (223 MB / s per HDD max.)
    • Write = 22.0GB / s max. @ 1MB (216 MB / s per HDD max.)

    Random load
    • Read 4kB with queue depth = 128:> 26k IOps
    • Recording with queue depth = 1–128:> 45k IOps


    Fig. 9. Parallel load from 12 servers


    Fig. 10. Parallel load from 6 servers

    Control


    There are two ways to control JBOD from the software side:

    1. By SES
    2. By Redfish

    RedFish allows you to find components by lighting LEDs, receive information about the "health" of components, and update firmware.
    By the way, the chassis supports T10 Power Disabling (Pin 3) to turn off the power and reset individual drives.

    This is useful if your drive suspends the entire SAS bus.

    Typical configurations


    In order to take full advantage of the capabilities of such a JBOD, we will need RAID controllers or software. This is where RAIDIX comes to the rescue.

    To create fault-tolerant storage, we need two storage nodes and one or more baskets with SAS disks. If we do not want to implement node failure protection or use data replication, then we can connect one server to the basket and use SATA disks.

    Dual controller configuration


    Almost any x86 server platform architecture: Supermicro, AIC, Dell, Lenovo, HPE and many others can be used as controllers for RAIDIX-based storage systems. We are constantly working on certification of new equipment and port our code for various architectures (for example, Elbrus and OpenPower).

    For example, take the Supermicro platform and try to achieve the highest possible throughput and density of calculations. When “sizing” the servers we will use the PCI-E bus, where we install the back-end and front-end controllers.

    We will also need controllers for connecting a disk shelf, at least two AVAGO 9300-8e. Alternatively: a pair of 9400-8e or one 9405W-16e, but for the latter you will need a full x16 slot.

    The next component is the slot for the synchronization channel. It can be Infiniband or SAS. (For tasks where bandwidth and latency are not critical, synchronization through the basket can be dispensed with without a dedicated slot.)

    Well, of course, we will need slots for host interfaces, which should also be at least two.

    Total: each controller needs to have from 5 x8 slots (without margin for further scaling). To build low-cost systems focused on performance of 3-4 GB / s per node, we can do with just two slots.

    Controller Configuration Options


    Supermicro 6029P-TRT
    Controllers can be hosted on two 2U 6029P-TRT servers. They are not the richest in terms of PCI-E slots, but they have a standard motherboard without raisers. Micron NVDIMM-N modules are guaranteed to “start up” on these boards to protect the cache from power failures.

    To connect the drives, take the Broadcom 9400 8e. Dirty cache segments will be synchronized through IB 100Gb.

    Attention! The configurations below are designed to maximize the performance and functionality of all available options. For your specific task, the specification can be significantly reduced. Contact our partners.

    The configuration of the system that we got:
    No.NameDescriptionP / NQty per RAIDIX DC
    onePlatformSuperServer 6029P-TRTSYS-6029P-TRT2
    2CPUIntel Xeon Silver 4112 ProcessorIntel Xeon Silver 4112 Processor4
    3Memory16GB PC4-21300 2666MHz DDR4 ECC Registered DIMM Micron MTA36ASF472PZ-2G6D1MEM-DR416L-CL06-ER2612
    4System diskSanDisk Extreme PRO 240GBSDSSDXPS-240G-G254
    5Hot-swap 3.5 "to 2.5" SATA / SAS Drive TraysTool-less black hot-swap 3.5-to-2.5 ​​converter HDD drive tray (Red tab)MCP-220-00118-0B4
    6HBA for cache-syncMellanox ConnectX-4 VPI adapter card, EDR IB (100Gb / s), dual-port QSFP28, PCIe3.0 x16MCX456A-ECAT2
    7HBA for JBOD connectionBroadcom HBA 9400-8e Tri-Mode Storage Adapter05-50013-014
    8Ethernet patchcordEthernet patch cord for cache sync 0.5mone
    nineCable for cache syncMellanox passive copper cable, VPI, EDR 1mMCP1600-E0012
    10HBA for host connectionMellanox ConnectX-4 VPI adapter card, EDR IB (100Gb / s), dual-port QSFP28, PCIe3.0 x16MCX456A-ECAT2
    elevenSAS cableStorage Enclosure Ultrastar Data102 Cable IO HD mini-SAS to HD mini-SAS 2m 2Pack8
    12JbodUltrastar Data102one
    thirteenRAIDIXRAIDIX 4.6 DC / NAS / iSCSI / FC / SAS / IB / SSD-cache / QoSmic / SanOpt / Extended 5 years support / unlimited disks /RX46DSMMC-NALL-SQ0S-P5one

    Here is an example diagram:


    Fig. 11. Configuration based on Supermicro 6029P-TRT

    Supermicro 2029BT-DNR
    If we want to compete for space in the server room, then Supermicro Twin, for example, 2029BT-DNR, can be used as the basis for storage controllers. These systems have 3 PCI-E slots and one IOM module. Among the IOM, we have the Infiniband we need.

    Configuration:
    No.NameDescriptionP / NQty per RAIDIX DC
    onePlatformSuperServer 2029BT-DNRSYS-2029BT-DNRone
    2CPUIntel Xeon Silver 4112 ProcessorIntel Xeon Silver 4112 Processor4
    3Memory16GB PC4-21300 2666MHz DDR4 ECC Registered DIMM Micron MTA36ASF472PZ-2G6D1MEM-DR416L-CL06-ER2612
    4System diskSupermicro SSD-DM032-PHISSD-DM032-PHI2
    5HBA for cache-syncMellanox ConnectX-4 VPI adapter card, EDR IB (100Gb / s), dual-port QSFP28, PCIe3.0 x16MCX456A-ECAT2
    6HBA for JBOD connectionBroadcom HBA 9405W-16e Tri-Mode Storage Adapter05-50044-002
    7Ethernet patchcordEthernet patch cord for cache sync 0.5mone
    8Cable for cache syncMellanox passive copper cable, VPI, EDR 1mMCP1600-E0012
    nineHBA for host connectionMellanox ConnectX-4 VPI adapter card, EDR IB (100Gb / s), dual-port QSFP28, PCIe3.0 x16MCX456A-ECAT2
    10SAS cableStorage Enclosure Ultrastar Data102 Cable IO HD mini-SAS to HD mini-SAS 2m 2Pack8
    elevenJbodUltrastar Data102one
    12RAIDIXRAIDIX 4.6 DC / NAS / iSCSI / FC / SAS / IB / SSD-cache / QoSmic / SanOpt / Extended 5 years support / unlimited disksRX46DSMMC-NALL-SQ0S-P5one

    Here is an example diagram:


    Fig. 12. Configuration based on Supermicro 2029BT-DNR

    Platform 1U
    Often there are tasks where maximum density of large data volumes is required, but not, for example, complete fault tolerance on controllers is required. In this case, we take the 1U system as a basis and connect the maximum number of disk shelves to it.

    Scale-Out System


    As the last exercise in our training, we will build a horizontally-scalable system based on HyperFS. To begin with, we will choose 2 types of controllers - for storing data and for storing metadata.

    The storage controllers are SuperMicro 6029P-TRT.

    To store metadata, we use several SSD drives in the basket, which we will combine in RAID and give MDC via SAN. On one storage system, we can connect up to 4 JBODs in cascade. In total, in one deep rack we place X PB data with a single namespace.
    No.NameDescriptionP / NQty per RAIDIX DC
    onePlatformSuperServer 6029P-TRTSYS-6029P-TRT2
    2CPUIntel Xeon Silver 4112 ProcessorIntel Xeon Silver 4112 Processor4
    3Memory16GB PC4-21300 2666MHz DDR4 ECC Registered DIMM Micron MTA36ASF472PZ-2G6D1MEM-DR416L-CL06-ER26sixteen
    4System diskSanDisk Extreme PRO 240GBSDSSDXPS-240G-G254
    5Hot-swap 3.5" to 2.5" SATA/SAS Drive TraysTool-less black hot-swap 3.5-to-2.5 converter HDD drive tray (Red tab)MCP-220-00118-0B4
    6HBA for cache-syncMellanox ConnectX-4 VPI adapter card, EDR IB (100Gb/s), dual-port QSFP28, PCIe3.0 x16MCX456A-ECAT2
    7HBA for JBOD connectionBroadcom HBA 9400-8e Tri-Mode Storage Adapter05-50013-014
    8Ethernet patchcordEthernet patch cord for cache sync 0.5m1
    9Cable for cache syncMellanox passive copper cable, VPI, EDR 1mMCP1600-E0012
    10HBA for host connectionMellanox ConnectX-4 VPI adapter card, EDR IB (100Gb/s), dual-port QSFP28, PCIe3.0 x16MCX456A-ECAT2
    11SAS cableStorage Enclosure Ultrastar Data102 Cable IO HD mini-SAS to HD mini-SAS 2m 2Pack8
    12JBODUltrastar Data1021
    13RAIDIXRAIDIX 4.6 DC/NAS/iSCSI/FC/SAS/IB/SSD-cache/QoSmic/SanOpt/Extended 5 years support/unlimited disks/RX46DSMMC-NALL-SQ0S-P51
    14Платформа (MDC HyperFS)SuperServer 6028R-E1CR12LSSG-6028R-E1CR12L1
    15CPU (MDC HyperFS)Intel Xeon E5-2620v4 ProcessorIntel Xeon E5-2620v4 Processor2
    16Memory (MDC HyperFS)32GB DDR4 Crucial CT32G4RFD424A 32Gb DIMM ECC Reg PC4-19200 CL17 2400MHzCT32G4RFD424A4
    17System Disk (MDC HyperFS)SanDisk Extreme PRO 240GBSDSSDXPS-240G-G252
    eighteenHot-swap 3.5 "to 2.5" SATA / SAS Drive Trays (MDC HyperFS)Tool-less black hot-swap 3.5-to-2.5 ​​converter HDD drive tray (Red tab)MCP-220-00118-0B2
    nineteenHBA (MDC HyperFS)Mellanox ConnectX-4 VPI adapter card, EDR IB (100Gb / s), dual-port QSFP28, PCIe3.0 x16MCX456A-ECATone

    Here is an example connection diagram:


    Fig. 13. Scale-Out System Configuration

    Conclusion


    Working with large amounts of data, especially on write-intensive patterns, is a very difficult task for storage systems, the classic solution of which is to purchase shared-nothing scale-out systems. Western Digital’s new JBOD and RAIDIX software will allow you to build storage on several petabytes and several tens of GBps of performance much cheaper than when using horizontally-scalable systems, and we recommend that you pay attention to this solution.

    UPD


    Added system specification with Micron NVMDIMM-N:
    No.NameDescriptionP / NQty per RAIDIX DC
    onePlatformSuperServer 6029P-TRTSYS-6029P-TRT2
    2CPUIntel Xeon Silver 4112 ProcessorIntel Xeon Silver 4112 Processor4
    3Memory16GB PC4-21300 2666MHz DDR4 ECC Registered DIMM Micron MTA36ASF472PZ-2G6D1MEM-DR416L-CL06-ER2612
    4NVRAM16GB (x72, ECC, SR) 288-Pin DDR4 Nonvolatile RDIMM MTA18ASF2G72PF1ZMTA18ASF2G72PF1Z-2G6V21AB4
    5System diskSanDisk Extreme PRO 240GBSDSSDXPS-240G-G254
    6Hot-swap 3.5" to 2.5" SATA/SAS Drive TraysTool-less black hot-swap 3.5-to-2.5 converter HDD drive tray (Red tab)MCP-220-00118-0B4
    7HBA for cache-syncMellanox ConnectX-4 VPI adapter card, EDR IB (100Gb/s), dual-port QSFP28, PCIe3.0 x16MCX456A-ECAT2
    8HBA for JBOD connectionBroadcom HBA 9400-8e Tri-Mode Storage Adapter05-50013-014
    9Ethernet patchcordEthernet patch cord for cache sync 0.5m1
    10Cable for cache syncMellanox passive copper cable, VPI, EDR 1mMCP1600-E0012
    11HBA for host connectionMellanox ConnectX-4 VPI adapter card, EDR IB (100Gb/s), dual-port QSFP28, PCIe3.0 x16MCX456A-ECAT2
    12SAS cableStorage Enclosure Ultrastar Data102 Cable IO HD mini-SAS to HD mini-SAS 2m 2Pack8
    13JBODUltrastar Data1021
    14RAIDIXRAIDIX 4.6 DC/NAS/iSCSI/FC/SAS/IB/SSD-cache/QoSmic/SanOpt/Extended 5 years support/unlimited disks/RX46DSMMC-NALL-SQ0S-P51

    Also popular now: