Secure storage with DRBD9 and Proxmox (Part 2: iSCSI + LVM)

In a previous article, I looked at the possibility of creating a fault-tolerant NFS server using DRBD and Proxmox. It turned out pretty well, but we will not stop there and now we will try to "squeeze all the juice" out of our store.

In this article I will tell you how to create a fault-tolerant iSCSI target in a similar way, which, with the help of LVM, we will cut into small pieces and use for virtual machines.

This approach will reduce the load and increase the speed of access to data several times, it is especially beneficial when competitive access to data is not required, for example in the case when you need to organize storage for virtual machines.

A few words about DRBD

DRBD is a fairly simple and mature solution, the code of the eighth version is adopted in the Linux kernel. In essence, the network mirror is RAID1. In the ninth version, there is support for quorum and replication with more than two nodes.

In fact, it allows you to combine block devices on several physical nodes into one common network share.

Using DRBD you can achieve very interesting configurations. Today we will talk about iSCSI and LVM.

You can learn more about it by reading my previous article , where I described this solution in detail.

A couple of words about iSCSI

iSCSI is a protocol for delivering a block device over a network.

Unlike the NBD, it supports authorization, works without problems with network failures and supports many other useful functions, and most importantly shows very good performance.

There is a huge number of its implementations, some of them are also included in the kernel and do not require any special difficulties for its configuration and connection.

A couple of words about LVM

It is worth mentioning that LINBIT has its own solution for Proxmox, it should work out of the box and allow you to achieve a similar result, but in this article I would not like to focus only on Proxmox and describe some more universal solution that is suitable for both Proxmox and something else, in this example, proxmox is used only as a means of orchestrating containers, in fact, you can replace it with another solution, for example, launch containers with Target in Kubernetes.

As for Proxmox specifically, it works fine with shared LUN and LVM, using only its own standard drivers.

The advantages of LVM include the fact that its use is not something revolutionary new and insufficiently run-in, but, on the contrary, it shows dry stability, which is usually required from storage. It is worth mentioning that LVM is quite actively used in other environments, for example, in OpenNebula or in Kubernetes and is supported quite well there.

Thus, you will receive universal storage that can be used in different systems (not only in proxmox), using only ready-made drivers, without much need to modify it with a file.

Unfortunately, when choosing a solution for storage, you always have to make some compromises. So here, this solution will not give you the same flexibility as for example Ceph.
The virtual disk size is limited by the size of the LVM group, and the area marked up for a specific virtual disk will necessarily be preallocated - this greatly improves the speed of access to data, but does not allow for Thin-Provisioning (when the virtual disk takes up less space than it actually is). It is worth mentioning that the performance of LVM sags quite a lot when using snapshots, and therefore the possibility of their free use is often excluded.

Yes, LVM supports Thin-Provision pools, which are deprived of this disadvantage, but unfortunately their use is possible only in the context of one node and there is no possibility to share one Thin-Provision pool for several nodes in a cluster.

But despite these shortcomings, due to its simplicity, LVM still does not allow competitors to bypass it and completely oust it from the battlefield.

With a fairly small overhead, LVM still represents a very fast, stable and reasonably flexible solution.

General scheme

We have three nodes
On each node distributed drbd device .
On top of the drbd device, an LXC container with iSCSI target is running.
Target is connected to all three nodes.
A LVM-group is created on the connected target .
If necessary, LXC container can move to another node, along with iSCSI target

Customization

With the idea sorted out now move on to the implementation.

By default , the module of the eighth version of drbd is supplied with the Linux kernel , unfortunately it does not suit us and we need to install the module of the ninth version.

Connect the LINBIT repository and install everything you need:

wget -O- https://packages.linbit.com/package-signing-pubkey.asc | apt-key add - 
echo"deb http://packages.linbit.com/proxmox/ proxmox-5 drbd-9.0" \
  > /etc/apt/sources.list.d/linbit.list
apt-get update && apt-get -y install pve-headers drbd-dkms drbd-utils drbdtop

pve-headers - kernel headers needed to build the module
drbd-dkms - kernel module in DKMS format
drbd-utils - basic utilities for managing DRBD
drbdtop - interactive tool like top for DRBD only

After installing the module, check if everything is fine with it:

# modprobe drbd# cat /proc/drbd 
version: 9.0.14-1 (api:2/proto:86-113)

If you see the eighth version in the output of the command , something went wrong and the in-tree kernel module is loaded . Check dkms statusout the reason.

Each node will have the same drbd device running on top of the usual partitions. First we need to prepare this section for drbd on each node.

As such a partition can be any block device , it can be lvm, zvol, a disk partition or the entire disk. In this article I will use a separate nvme disk with a partition for drbd:/dev/nvme1n1p1

It is worth noting that device names tend to change sometimes, so it’s better to take the habit of using a permanent symlink on a device right away.

You /dev/nvme1n1p1can find such a symlink for this:

# find /dev/disk/ -lname '*/nvme1n1p1'
/dev/disk/by-partuuid/847b9713-8c00-48a1-8dff-f84c328b9da2
/dev/disk/by-path/pci-0000:0e:00.0-nvme-1-part1
/dev/disk/by-id/nvme-eui.0000000001000000e4d25c33da9f4d01-part1
/dev/disk/by-id/nvme-INTEL_SSDPEKKA010T7_BTPY703505FB1P0H-part1

We describe our resource on all three nodes:

# cat /etc/drbd.d/tgt1.res
resource tgt1 {
  meta-disk internal;
  device    /dev/drbd100;
  protocol  C;
  net { 
    after-sb-0pri discard-zero-changes;
    after-sb-1pri discard-secondary;
    after-sb-2pri disconnect;
  }
  on pve1 {
    address   192.168.2.11:7000;
    disk      /dev/disk/by-partuuid/95e7eabb-436e-4585-94ea-961ceac936f7;
    node-id   0;
  }
  on pve2 {
    address   192.168.2.12:7000;
    disk      /dev/disk/by-partuuid/aa7490c0-fe1a-4b1f-ba3f-0ddee07dfee3;
    node-id   1;
  }
  on pve3 {
    address   192.168.2.13:7000;
    disk      /dev/disk/by-partuuid/847b9713-8c00-48a1-8dff-f84c328b9da2;
    node-id   2;
  }
  connection-mesh {
    hosts pve1 pve2 pve3;
  }
}

It is advisable to synchronize drbd to use a separate network .

Now create the metadata for drbd and launch it:

# drbdadm create-md tgt1
initializing activity log
initializing bitmap (320 KB) to all zero
Writing meta data...
New drbd meta data block successfully created.
success
# drbdadm up tgt1

Repeat these actions on all three nodes and check the status:

# drbdadm status
tgt1 role:Secondary
  disk:Inconsistent
  pve2 role:Secondary
    peer-disk:Inconsistent
  pve3 role:Secondary
    peer-disk:Inconsistent

Now our disk is Inconsistent on all three nodes, this is because drbd does not know which disk should be taken as the original. We need to mark one of them as Primary , so that its state is synchronized to the other nodes:

drbdadm primary --force tgt1
drbdadm secondary tgt1

Immediately after this, synchronization will start :

# drbdadm status
tgt1 role:Secondary
  disk:UpToDate
  pve2 role:Secondary
    replication:SyncSource peer-disk:Inconsistent done:26.66
  pve3 role:Secondary
    replication:SyncSource peer-disk:Inconsistent done:14.20

We don’t have to wait until it is finished and we can follow up the next steps in parallel. They can be performed on any node , regardless of its current state of the local disk in DRBD. All requests will be automatically redirected to the device with the UpToDate state.

Don't forget to activate the autorun of the drbd service on the nodes:

systemctl enable drbd.service

Configure LXC Container

Omit part of the configuration of the Proxmox cluster of three nodes, this part is well described in the official wiki

As I said before, our iSCSI target will work in an LXC container . We will keep the container on the device /dev/drbd100we just created.

First we need to create a file system on it:

mkfs -t ext4 -O mmp -E mmp_update_interval=5 /dev/drbd100

Proxmox by default includes multimount protection at the file system level, in principle we can do without it, because DRBD by default has its own protection, it will simply prohibit the second Primary for the device, but caution will not harm us.

Now download the Ubuntu template:

# wget http://download.proxmox.com/images/system/ubuntu-16.04-standard_16.04-1_amd64.tar.gz -P /var/lib/vz/template/cache/

And create from it our container:

pct create101local:vztmpl/ubuntu-16.04-standard_16.04-1_amd64.tar.gz \
  --hostname=tgt1 \--net0=name=eth0,bridge=vmbr0,gw=192.168.1.1,ip=192.168.1.11/24 \--rootfs=volume=/dev/drbd100,shared=1

In this command, we specify that the root system of our container will be on the device /dev/drbd100and add a parameter shared=1to allow the migration of the container between nodes.

If something went wrong, you can always fix it through the Proxmox interface or in the container config/etc/pve/lxc/101.conf

Proxmox will unpack the template and prepare the container root system for us. After that we can run our container:

pct start101

Setting up an iSCSI target.

Of the entire set of targets , I chose istgt , since it has the highest performance and works in user space.

Now let's log in to our container:

pct exec101 bash

Install the update and istgt :

apt-get update 
apt-get -y upgrade
apt-get -y install istgt

Create a file that we will give over the network:

mkdir -p /datafallocate -l 740G /data/target1.img

Now we need to write a config for istgt/etc/istgt/istgt.conf :

[Global]
  Comment"Global section"
  NodeBase "iqn.2018-07.org.example.tgt1"
  PidFile /var/run/istgt.pid
  AuthFile /etc/istgt/auth.conf
  MediaDirectory /var/istgt
  LogFacility "local7"Timeout30
  NopInInterval 20
  DiscoveryAuthMethod Auto
  MaxSessions 16
  MaxConnections 4
  MaxR2T 32
  MaxOutstandingR2T 16
  DefaultTime2Wait 2
  DefaultTime2Retain 60
  FirstBurstLength 262144
  MaxBurstLength 1048576
  MaxRecvDataSegmentLength 262144
  InitialR2T Yes
  ImmediateData Yes
  DataPDUInOrder Yes
  DataSequenceInOrder Yes
  ErrorRecoveryLevel 0
[UnitControl]
  Comment"Internal Logical Unit Controller"
  AuthMethod CHAP Mutual
  AuthGroup AuthGroup10000
  Portal UC1 127.0.0.1:3261
  Netmask 127.0.0.1
[PortalGroup1]
  Comment"SINGLE PORT TEST"
  Portal DA1 192.168.1.11:3260
[InitiatorGroup1]
  Comment"Initiator Group1"
  InitiatorName "ALL"
  Netmask 192.168.1.0/24
[LogicalUnit1]
  Comment"Hard Disk Sample"
  TargetName disk1
  TargetAlias "Data Disk1"Mapping PortalGroup1 InitiatorGroup1
  AuthMethod Auto
  AuthGroup AuthGroup1
  UseDigest Auto
  UnitType Disk
  LUN0 Storage /data/target1.img Auto

Restart istgt:

systemctl restart istgt

At this point, the target setting is completed.

HA Setup

Now we can go to the HA-manager configuration . Create a separate HA group for our device:

ha-manager groupadd tgt1 --nodes pve1,pve2,pve3 --nofailback=1 --restricted=1

Our resource will work only on the nodes specified for this group. Add our container to this group:

ha-manager add ct:101 --group=tgt1 --max_relocate=3 --max_restart=3

Recommendations and tuning

DRBD

As I noted above, it is always advisable to use a separate network for replication. It is highly desirable to use 10-gigabit network adapters , otherwise you all will rest on the speed of the ports.
If replication seems slow enough, try tuning in some parameters for DRBD . Here is the config that, in my opinion, is optimal for my 10G network :

# cat /etc/drbd.d/global_common.conf
global {
 usage-count yes;
 udev-always-use-vnr; 
}
common {
 handlers {
 }
 startup {
 }
 options {
 }
 disk {
  c-fill-target 10M;
  c-max-rate   720M;
  c-plan-ahead   10;
  c-min-rate    20M;
 }
 net {
  max-buffers     36k;
  sndbuf-size   1024k;
  rcvbuf-size   2048k;
 }
}

More information about each parameter you can get information from the official documentation DRBD

Open iSCSI

Since we do not use multipathing, in our case it is recommended to disable periodic connection checks on clients, as well as to increase waiting timeouts for restoring the session to /etc/iscsi/iscsid.conf.

node.conn[0].timeo.noop_out_interval = 0
node.conn[0].timeo.noop_out_timeout = 0
node.session.timeo.replacement_timeout = 86400

Using

Proxmox

The resulting iSCSI target can be immediately connected to Proxmox, without having forgotten to uncheck Use LUN Directly .

Immediately after this, it will be possible to create LVM on top of it, do not forget to tick the shared one :

Other environments

If you plan to use this solution in a different environment, you may need to install a cluster extension for LVM at the moment from two implementations. CLVM and lvmlockd .

Setting up a CLVM is not so trivial and requires a running cluster manager.
Where as the second method lvmlockd - not yet fully tested and is just starting to appear in stable repositories.

I recommend reading a great article on blocking in LVM

When using LVM with Proxmox, cluster addition is not required , since volume management is provided by proxmox itself, which updates and monitors LVM metadata on its own. The same goes for OpenNebula , which is clearly indicated by the official documentation .

Tags: