Inconvenient questions about RDMA architecture
We have accumulated an array of materials related to the study of the architecture of Remote Direct Memory Access . During its formation, a number of points became more clear, but the mechanisms of some implementations still remained only in the form of assumptions. Unfortunately, the existing problem of remote access directly to memory comes down to a simplified model of refusing unnecessary transfers. Obviously, in the case of RDMA, we are dealing with an entity that generates a new quality of cross-platform interaction, the basis of which is laid by such cornerstone concepts as IfiniBand and NUMA . ( Hereinafter, by RDMA we mean RoCE - one of the current Zero-Copy implementations of direct memory access, reduced to the usual Ethernet transport ).
To understand the place of RDMA in the existing problems of high-performance computing (and related prospects for the development of server building), I would like to find the answer to those questions that were “behind the scenes” of the current information support. (This, of course, is about texts generated by leading manufacturers of RNIC-controllers, namely: Broadcom, Intel, Mellanox).
Today, two points remain unclear. The most important of these is the memory access model. As you know , the exchange over the network can be carried out using tagged requests (Read and Write operations) and untagged (Send operation). How good are each models in terms of performance? And if the user can, if not control the selection, then at least analyze the statistics of RDMA transactions in the specified section?
Open sources make it possible to reasonably assume that tagged requests are associated with remote control of memory addressing of a remote platform. With their help, a single address space is formed from physically separate local components that form a cluster. Untagged requests appear to be oriented towards more traditional data block transfer operations.
This is not an idle curiosity: the results obtained in a series of experiments indicate that the advantages and possibilities of the “black RDMA box”, as expected, are not infinite. Massive data transfers are becoming an overwhelming burden, both for advanced technology and for the ever-alive classics. Watch the Mellanox training video (pay attention to the fragment that starts at 29:04):
Here on the agenda goes a deep understanding and fine tuning - the key to efficiency. In such a situation, it is appropriate to ask whether there are conflicts of performance criteria with compatibility requirements? After all, if this is so, then for the effective use of RDMA technology, a more radical software redesign will be required - from low-level drivers to user applications.
Another sacramental question: what, in fact, locally provides direct access to memory? Recall that in modern platforms there are two models for performing operations of this kind:
Despite the fact that the DMA Engine is physically located as part of the central processor chip, its use means freeing the core, or rather, the core of the CPU from routine I / O.
Effective encapsulation of low-level aspects by architects of the RDMA protocol did not allow us to come to an unambiguous conclusion which of the methods (decentralized or centralized) the developers preferred. Although today there are significant arguments in favor of using Bus Master due to the fact that the script for memory access operations is controlled by information coming over the network. Nevertheless, the answer to this question remains to be found: it is reasonable to assume that both options are supported and it all depends on the platform configuration ...
To understand the place of RDMA in the existing problems of high-performance computing (and related prospects for the development of server building), I would like to find the answer to those questions that were “behind the scenes” of the current information support. (This, of course, is about texts generated by leading manufacturers of RNIC-controllers, namely: Broadcom, Intel, Mellanox).
Classification of operations and performance
Today, two points remain unclear. The most important of these is the memory access model. As you know , the exchange over the network can be carried out using tagged requests (Read and Write operations) and untagged (Send operation). How good are each models in terms of performance? And if the user can, if not control the selection, then at least analyze the statistics of RDMA transactions in the specified section?
Open sources make it possible to reasonably assume that tagged requests are associated with remote control of memory addressing of a remote platform. With their help, a single address space is formed from physically separate local components that form a cluster. Untagged requests appear to be oriented towards more traditional data block transfer operations.
This is not an idle curiosity: the results obtained in a series of experiments indicate that the advantages and possibilities of the “black RDMA box”, as expected, are not infinite. Massive data transfers are becoming an overwhelming burden, both for advanced technology and for the ever-alive classics. Watch the Mellanox training video (pay attention to the fragment that starts at 29:04):
Here on the agenda goes a deep understanding and fine tuning - the key to efficiency. In such a situation, it is appropriate to ask whether there are conflicts of performance criteria with compatibility requirements? After all, if this is so, then for the effective use of RDMA technology, a more radical software redesign will be required - from low-level drivers to user applications.
Bus Master vs DMA Engine
Another sacramental question: what, in fact, locally provides direct access to memory? Recall that in modern platforms there are two models for performing operations of this kind:
- The decentralized model implies that the RNIC controller in Bus Master mode is able to independently interact with RAM, performing read and write operations; here we can say that the DMA controller is part of the RNIC.
- The centralized model uses the DMA Engine host, which is part of Intel Xeon processors; it is a kind of input-output processor that provides hardware support for the rapid movement of huge amounts of data, with elements of intelligent processing that are in demand for implementing a number of devices, for example, disk RAID arrays and network NIC controllers; this node replaced the “ancient” Intel 8237 DMA controller, the architecture of which was developed even before the appearance of the ISA bus.
Despite the fact that the DMA Engine is physically located as part of the central processor chip, its use means freeing the core, or rather, the core of the CPU from routine I / O.
Effective encapsulation of low-level aspects by architects of the RDMA protocol did not allow us to come to an unambiguous conclusion which of the methods (decentralized or centralized) the developers preferred. Although today there are significant arguments in favor of using Bus Master due to the fact that the script for memory access operations is controlled by information coming over the network. Nevertheless, the answer to this question remains to be found: it is reasonable to assume that both options are supported and it all depends on the platform configuration ...