Secure storage with DRBD9 and Proxmox (Part 1: NFS)

Probably anyone who has ever been puzzled by the search for high-performance software-defiined storage sooner or later heard about DRBD , and maybe even dealt with it.
True, at the peak of popularity of Ceph and GlusterFS , which work well in principle, and most importantly right out of the box, everyone just forgot about it a bit. Moreover, the previous version did not support the replication of more than two nodes, and because of what there were often problems with the split-brain , which obviously did not add to its popularity.
The decision and the truth is not new, but quite competitive. With relatively low CPU and RAM costs, DRBD provides really fast and secure synchronization at the block device level . For all this time, LINBIT - DRBD developers are not standing still and are constantly refining it. Starting with the DRBD9 version, it ceases to be just a network mirror and becomes something more.
First, the idea of creating one distributed block device for several servers has receded into the background, and now LINBIT is trying to provide orchestration and management tools for multiple drbd devices in a cluster that are created on top of LVM and ZFS partitions .
For example, DRBD9 supports up to 32 replicas, RDMA, diskless nodes, and new orchestration tools allow you to use snapshots, online migration, and much more.
Despite the fact that DRBD9 has integration tools with Proxmox , Kubernetes , OpenStack and OpenNebula , at the moment they are in a transition mode, when new tools are not supported everywhere, and the old ones will soon be declared as deprecated . These are DRBDmanage and Linstor .
I will take advantage of this moment so as not to go into great detail in each of them, but in more detail to consider the configuration and principles of working with DRBD9 itself . You will still have to deal with this, if only because the fail-safe configuration of the Linstor controller implies installing it on one of these devices.
In this article I would like to tell you about DRBD9 and the possibility of its use in Proxmox without third-party plug-ins.
DRBDmanage and Linstor
Firstly, it is worth mentioning once again about DRBDmanage , which is very well integrated into Proxmox . LINBIT provides a ready-made DRBDmanage plugin for Proxmox which allows you to use all its functions directly from the Proxmox interface .
It looks really awesome, but unfortunately it has some downsides.
- First, the volume names, LVM-group or ZFS-pool must have a name drbdpool.
- Inability to use more than one pool per node
- Due to the specifics of the solution, the controller volume can only be located on a regular LVM and nothing else
- Periodic dbus glitches , which are closely used by DRBDmanage to interact with nodes.
As a result, LINBIT made the decision to replace all the complex logic of DRBDmanage with a simple application that communicates with the nodes using the usual tcp connection and works without any magic there. This is how Linstor appeared .
Linstor really works very well. Unfortunately, the developers chose java as the main language for writing Linstor-server, but don't let that frighten you, since Linstor itself is only concerned with the distribution of DRBD configs and cutting LVM / ZFS sections on nodes.
Both solutions are free and distributed under the free GPL3 license .
You can read about each of them and about setting up the above-mentioned Proxmox plugin on the official Proxmox wiki.
Failsafe NFS Server
Unfortunately at the time of this writing, Linstor has a ready-made integration only with Kubernetes . But at the end of the year, drivers for the rest of Proxmox , OpenNebula , OpenStack are also expected .
But so far there is no ready-made solution, and we don’t like the old one anyway. Let's try to use DRBD9 in the old manner to organize NFS access to a common partition.
Nevertheless, this solution also turns out to be not without advantages, because the NFS server allows you to organize competitive access to the storage file system from several servers without the need to use complex cluster file systems with DLM, such as OCFS and GFS2.
In this case, you will be able to switch the roles of Primary / Secondary nodes simply by migrating the container with the NFS server in the Proxmox interface.
You can also store any files inside this file system, as well as virtual disks and backups.
In case you use Kubernetes, you can arrange ReadWriteMany access for your PersistentVolumes .
Proxmox and LXC containers
Now the question is: why Proxmox?
In principle, for the construction of such a scheme, we could use both Kubernetes and the usual scheme with a cluster manager. But Proxmox provides a ready-made, very multifunctional and at the same time simple and intuitive interface for almost everything you need. It is able to clustering out of the box and supports the softdog based fencing mechanism . And when using LXC containers, you can achieve minimum timeouts when switching. 
The resulting solution will not have a single point of failure .
In essence, we will use Proxmox primarily as a cluster-manager , where we can view a separate LXC container as a service running in a classic HA cluster, with the only difference that its bundled system also comes with its root system . That is, you do not need to install several eczemals of the service on each server separately; you can only do this once inside the container. 
If you have ever worked with cluster-manager software and providing HA for applications, you will understand what I mean.
General scheme
Our solution will resemble the standard replication scheme of some database.
- We have three nodes
- On each node distributed drbd device .
- On the device, the usual file system ( ext4 )
- Only one server can be master
- The wizard runs the NFS server in the LXC container .
- All nodes access the device strictly via NFS.
- If necessary, the master can move to another node, along with the NFS server.
DRBD9 has one very cool feature that makes everything much easier: the 
drbd device automatically becomes Primary when it is mounted on a node. If the device is marked as Primary , any attempt to mount it on another node will result in an access error. This ensures locking and guaranteed protection against simultaneous access to the device.
Why is this all so much easier? Because when the container is started, Proxmox automatically mounts this device and it becomes Primary on this node, and when the container is stopped, it unmounts the device and the device becomes Secondary again . 
Thus, we no longer need to worry about switching Primary / Secondary devices, Proxmox will do this automatically , Hurray!
DRBD setup
Well, well, with the idea figured out now let's move on to the implementation.
By default , the module of the eighth version of drbd is supplied with the Linux kernel , unfortunately it does not suit us and we need to install the module of the ninth version.
Connect the LINBIT repository and install everything you need:
wget -O- https://packages.linbit.com/package-signing-pubkey.asc | apt-key add - 
echo"deb http://packages.linbit.com/proxmox/ proxmox-5 drbd-9.0" \
  > /etc/apt/sources.list.d/linbit.list
apt-get update && apt-get -y install pve-headers drbd-dkms drbd-utils drbdtop- pve-headers- kernel headers needed to build the module
- drbd-dkms- kernel module in DKMS format
- drbd-utils- basic utilities to manage DRBD
- drbdtop- interactive tool like top for DRBD only
After installing the module, check if everything is fine with it:
# modprobe drbd# cat /proc/drbd 
version: 9.0.14-1 (api:2/proto:86-113)If you see the eighth version in the output of the command , then something went wrong and the in-tree kernel module is loaded . Check dkms statusout the reason.
Each node will have the same drbd device running on top of the usual partitions. First we need to prepare this section for drbd on each node.
This section can be any block device. , it can be lvm, zvol, a disk partition or the entire disk. In this article I will use a separate nvme disk with a partition for drbd:/dev/nvme1n1p1
It is worth noting that the names of devices tend to change sometimes, so it is better to take the habit of using a permanent symlink on the device.
You /dev/nvme1n1p1can find such a symlink for this:
# find /dev/disk/ -lname '*/nvme1n1p1'
/dev/disk/by-partuuid/847b9713-8c00-48a1-8dff-f84c328b9da2
/dev/disk/by-path/pci-0000:0e:00.0-nvme-1-part1
/dev/disk/by-id/nvme-eui.0000000001000000e4d25c33da9f4d01-part1
/dev/disk/by-id/nvme-INTEL_SSDPEKKA010T7_BTPY703505FB1P0H-part1We describe our resource on all three nodes:
# cat /etc/drbd.d/nfs1.res
resource nfs1 {
  meta-disk internal;
  device    /dev/drbd100;
  protocol  C;
  net { 
    after-sb-0pri discard-zero-changes;
    after-sb-1pri discard-secondary;
    after-sb-2pri disconnect;
  }
  on pve1 {
    address   192.168.2.11:7000;
    disk      /dev/disk/by-partuuid/95e7eabb-436e-4585-94ea-961ceac936f7;
    node-id   0;
  }
  on pve2 {
    address   192.168.2.12:7000;
    disk      /dev/disk/by-partuuid/aa7490c0-fe1a-4b1f-ba3f-0ddee07dfee3;
    node-id   1;
  }
  on pve3 {
    address   192.168.2.13:7000;
    disk      /dev/disk/by-partuuid/847b9713-8c00-48a1-8dff-f84c328b9da2;
    node-id   2;
  }
  connection-mesh {
    hosts pve1 pve2 pve3;
  }
}It is advisable to synchronize drbd using a separate network .
Now create the metadata for drbd and launch it:
# drbdadm create-md nfs1
initializing activity log
initializing bitmap (320 KB) to all zero
Writing meta data...
New drbd meta data block successfully created.
success
# drbdadm up nfs1Repeat these steps on all three nodes and check the status:
# drbdadm status
nfs1 role:Secondary
  disk:Inconsistent
  pve2 role:Secondary
    peer-disk:Inconsistent
  pve3 role:Secondary
    peer-disk:InconsistentNow our disk is Inconsistent on all three nodes, this is because drbd does not know which disk should be taken as the original. We need to mark one of them as Primary , so that its state is synchronized to the other nodes:
drbdadm primary --force nfs1
drbdadm secondary nfs1Immediately after this, synchronization will begin :
# drbdadm status
nfs1 role:Secondary
  disk:UpToDate
  pve2 role:Secondary
    replication:SyncSource peer-disk:Inconsistent done:26.66
  pve3 role:Secondary
    replication:SyncSource peer-disk:Inconsistent done:14.20
We do not have to wait for it to end and we can follow up the next steps in parallel. They can be performed on any node , regardless of its current state of the local disk in DRBD. All requests will be automatically redirected to the device with the UpToDate state.
It is worth remembering to activate the autorun of the drbd service on the nodes:
systemctl enable drbd.serviceConfiguring an LXC Container
Omit part of the configuration of the Proxmox cluster of three nodes, this part is well described in official wiki
As I said before, our NFS server will work in an LXC container . The container itself will be kept on the device /dev/drbd100we just created.
First we need to create a file system on it:
mkfs -t ext4 -O mmp -E mmp_update_interval=5 /dev/drbd100Proxmox by default includes a file system level multimount protection , in principle we can do without it, because DRBD by default has its own protection, it will simply prohibit the second Primary for the device, but caution will not harm us.
Now download the Ubuntu template:
# wget http://download.proxmox.com/images/system/ubuntu-16.04-standard_16.04-1_amd64.tar.gz -P /var/lib/vz/template/cache/And create from it our container:
pct create101local:vztmpl/ubuntu-16.04-standard_16.04-1_amd64.tar.gz \
  --hostname=nfs1 \--net0=name=eth0,bridge=vmbr0,gw=192.168.1.1,ip=192.168.1.11/24 \--rootfs=volume=/dev/drbd100,shared=1In this command, we specify that the root system of our container will be on the device /dev/drbd100and add a parameter shared=1to allow the migration of the container between nodes.
If something went wrong, you can always fix it through the Proxmox interface or in the container config/etc/pve/lxc/101.conf
Proxmox will unpack the template and prepare the container root system for us. After that we can run our container:
pct start101Configure NFS server.
By default, Proxmox does not allow the launch of an NFS server in a container, but there are several ways to allow this.
One of them is easy to add lxc.apparmor.profile: unconfinedto the configuration of our container /etc/pve/lxc/100.conf.
Or we can enable NFS for all containers permanently, for this you need to update the standard template for LXC on all nodes, add to the /etc/apparmor.d/lxc/lxc-default-cgnsfollowing lines:
mount fstype=nfs,
  mount fstype=nfs4,
  mount fstype=nfsd,
  mount fstype=rpc_pipefs,After the changes, restart the container:
pct shutdown 101
pct start101Now let's log in to it:
pct exec101 bashInstall updates and NFS server :
apt-get update 
apt-get -y upgrade
apt-get -y install nfs-kernel-serverCreate an export :
echo '/data *(rw,no_root_squash,no_subtree_check)' >> /etc/exportsmkdir /dataexportfs -aHA Setup
At the time of writing the proxmox HA-manager, there is a bug that prevents the HA container from successfully completing its work, as a result of which, the nfs server processes that are not fully killed by the kernel-space prevent the drbd device from going to Secondary . If you have already encountered such a situation you should not panic and simply execute on the node where the container was launched and then the drbd device should “release” and it will go to the Secondary .killall -9 nfsd
To fix this bug, execute the following commands on all nodes:
sed -i 's/forceStop => 1,/forceStop => 0,/' /usr/share/perl5/PVE/HA/Resources/PVECT.pm
systemctl restart pve-ha-lrm.serviceNow we can go to the HA-manager configuration . Create a separate HA group for our device:
ha-manager groupadd nfs1 --nodes pve1,pve2,pve3 --nofailback=1 --restricted=1Our resource will work only on the nodes specified for this group. Add our container to this group:
ha-manager add ct:101 --group=nfs1 --max_relocate=3 --max_restart=3That's all. Simple, isn't it?
The resulting nfs-ball can be immediately connected to Proxmox, for storing and running other virtual machines and containers.
Recommendations and tuning
DRBD
As I noted above, it is always advisable to use a separate network for replication. It is highly desirable to use 10-gigabit network adapters , otherwise you will have to rest at the speed of the ports. 
If replication seems slow enough, try tuning in some parameters for DRBD . Here is the config that, in my opinion, is optimal for my 10G network :
# cat /etc/drbd.d/global_common.conf
global {
 usage-count yes;
 udev-always-use-vnr; 
}
common {
 handlers {
 }
 startup {
 }
 options {
 }
 disk {
  c-fill-target 10M;
  c-max-rate   720M;
  c-plan-ahead   10;
  c-min-rate    20M;
 }
 net {
  max-buffers     36k;
  sndbuf-size   1024k;
  rcvbuf-size   2048k;
 }
}More information about each parameter you can get information from the official documentation of DRBD
NFS server
To speed up the work of the NFS server, an increase in the total number of running instances of the NFS server may help . By default - 8 , I personally helped increase this number to 64 .
To achieve this, update the setting RPCNFSDCOUNT=64in /etc/default/nfs-kernel-server. 
And restart the daemons:
systemctl restart nfs-utils
systemctl restart nfs-serverNFSv3 vs NFSv4
Do you know the difference between NFSv3 and NFSv4 ?
- NFSv3 is a stateless protocol; as a rule, it tolerates failures better and recovers faster.
- NFSv4 is a stateful protocol , it works faster and can be tied to certain tcp ports, but because of the presence of a state, it is more sensitive to failures. It also has the ability to use authentication using Kerberos and a bunch of other interesting features.
However, when you run showmount -e nfs_serverthe NFSv3 protocol is used. Proxmox also uses NFSv3. NFSv3 is also commonly used to organize booting of machines over a network.
In general, if you have no particular reason to use NFSv4, try to use NFSv3 since it is less painful when it experiences any failures due to the lack of state as such.
Mount the ball using NFSv3, you can specify the parameter -o vers=3for the mount command :
mount -o vers=3 nfs_server:/share /mntIf you wish, you can disable NFSv4 for the server at all, to do this, add an option --no-nfs-version 4to the variable RPCNFSDCOUNTand restart the server, for example:
RPCNFSDCOUNT="64 --no-nfs-version 4"iSCSI and LVM
Similarly, a regular tgt-daemon can be configured inside the container , iSCSI will provide much more performance for I / O operations, and the container will work more smoothly, in view of the fact that the tgt-server runs completely in user space.
Typically, an exported LUN is sliced into multiple pieces using LVM . However, there are several nuances that should be taken into account, for example: how are LVM locks provided for sharing an exported group on several hosts.
Perhaps these and other nuances I will describe in the next article .