Secure storage with DRBD9 and Proxmox (Part 2: iSCSI + LVM)
In a previous article, I looked at the possibility of creating a fault-tolerant NFS server using DRBD and Proxmox. It turned out pretty well, but we will not stop there and now we will try to "squeeze all the juice" out of our store.
In this article I will tell you how to create a fault-tolerant iSCSI target in a similar way, which, with the help of LVM, we will cut into small pieces and use for virtual machines.
This approach will reduce the load and increase the speed of access to data several times, it is especially beneficial when competitive access to data is not required, for example in the case when you need to organize storage for virtual machines.
A few words about DRBD
DRBD is a fairly simple and mature solution, the code of the eighth version is adopted in the Linux kernel. In essence, the network mirror is RAID1. In the ninth version, there is support for quorum and replication with more than two nodes.
In fact, it allows you to combine block devices on several physical nodes into one common network share.
Using DRBD you can achieve very interesting configurations. Today we will talk about iSCSI and LVM.
You can learn more about it by reading my previous article , where I described this solution in detail.
A couple of words about iSCSI
iSCSI is a protocol for delivering a block device over a network.
Unlike the NBD, it supports authorization, works without problems with network failures and supports many other useful functions, and most importantly shows very good performance.
There is a huge number of its implementations, some of them are also included in the kernel and do not require any special difficulties for its configuration and connection.
A couple of words about LVM
It is worth mentioning that LINBIT has its own solution for Proxmox, it should work out of the box and allow you to achieve a similar result, but in this article I would not like to focus only on Proxmox and describe some more universal solution that is suitable for both Proxmox and something else, in this example, proxmox is used only as a means of orchestrating containers, in fact, you can replace it with another solution, for example, launch containers with Target in Kubernetes.
As for Proxmox specifically, it works fine with shared LUN and LVM, using only its own standard drivers.
The advantages of LVM include the fact that its use is not something revolutionary new and insufficiently run-in, but, on the contrary, it shows dry stability, which is usually required from storage. It is worth mentioning that LVM is quite actively used in other environments, for example, in OpenNebula or in Kubernetes and is supported quite well there.
Thus, you will receive universal storage that can be used in different systems (not only in proxmox), using only ready-made drivers, without much need to modify it with a file.
Unfortunately, when choosing a solution for storage, you always have to make some compromises. So here, this solution will not give you the same flexibility as for example Ceph.
The virtual disk size is limited by the size of the LVM group, and the area marked up for a specific virtual disk will necessarily be preallocated - this greatly improves the speed of access to data, but does not allow for Thin-Provisioning (when the virtual disk takes up less space than it actually is). It is worth mentioning that the performance of LVM sags quite a lot when using snapshots, and therefore the possibility of their free use is often excluded.
Yes, LVM supports Thin-Provision pools, which are deprived of this disadvantage, but unfortunately their use is possible only in the context of one node and there is no possibility to share one Thin-Provision pool for several nodes in a cluster.
But despite these shortcomings, due to its simplicity, LVM still does not allow competitors to bypass it and completely oust it from the battlefield.
With a fairly small overhead, LVM still represents a very fast, stable and reasonably flexible solution.
General scheme
- We have three nodes
- On each node distributed drbd device .
- On top of the drbd device, an LXC container with iSCSI target is running.
- Target is connected to all three nodes.
- A LVM-group is created on the connected target .
- If necessary, LXC container can move to another node, along with iSCSI target
Customization
With the idea sorted out now move on to the implementation.
By default , the module of the eighth version of drbd is supplied with the Linux kernel , unfortunately it does not suit us and we need to install the module of the ninth version.
Connect the LINBIT repository and install everything you need:
wget -O- https://packages.linbit.com/package-signing-pubkey.asc | apt-key add -
echo"deb http://packages.linbit.com/proxmox/ proxmox-5 drbd-9.0" \
> /etc/apt/sources.list.d/linbit.list
apt-get update && apt-get -y install pve-headers drbd-dkms drbd-utils drbdtop
pve-headers
- kernel headers needed to build the moduledrbd-dkms
- kernel module in DKMS formatdrbd-utils
- basic utilities for managing DRBDdrbdtop
- interactive tool like top for DRBD only
After installing the module, check if everything is fine with it:
# modprobe drbd# cat /proc/drbd
version: 9.0.14-1 (api:2/proto:86-113)
If you see the eighth version in the output of the command , something went wrong and the in-tree kernel module is loaded . Check dkms status
out the reason.
Each node will have the same drbd device running on top of the usual partitions. First we need to prepare this section for drbd on each node.
As such a partition can be any block device , it can be lvm, zvol, a disk partition or the entire disk. In this article I will use a separate nvme disk with a partition for drbd:/dev/nvme1n1p1
It is worth noting that device names tend to change sometimes, so it’s better to take the habit of using a permanent symlink on a device right away.
You /dev/nvme1n1p1
can find such a symlink for this:
# find /dev/disk/ -lname '*/nvme1n1p1'
/dev/disk/by-partuuid/847b9713-8c00-48a1-8dff-f84c328b9da2
/dev/disk/by-path/pci-0000:0e:00.0-nvme-1-part1
/dev/disk/by-id/nvme-eui.0000000001000000e4d25c33da9f4d01-part1
/dev/disk/by-id/nvme-INTEL_SSDPEKKA010T7_BTPY703505FB1P0H-part1
We describe our resource on all three nodes:
# cat /etc/drbd.d/tgt1.res
resource tgt1 {
meta-disk internal;
device /dev/drbd100;
protocol C;
net {
after-sb-0pri discard-zero-changes;
after-sb-1pri discard-secondary;
after-sb-2pri disconnect;
}
on pve1 {
address 192.168.2.11:7000;
disk /dev/disk/by-partuuid/95e7eabb-436e-4585-94ea-961ceac936f7;
node-id 0;
}
on pve2 {
address 192.168.2.12:7000;
disk /dev/disk/by-partuuid/aa7490c0-fe1a-4b1f-ba3f-0ddee07dfee3;
node-id 1;
}
on pve3 {
address 192.168.2.13:7000;
disk /dev/disk/by-partuuid/847b9713-8c00-48a1-8dff-f84c328b9da2;
node-id 2;
}
connection-mesh {
hosts pve1 pve2 pve3;
}
}
It is advisable to synchronize drbd to use a separate network .
Now create the metadata for drbd and launch it:
# drbdadm create-md tgt1
initializing activity log
initializing bitmap (320 KB) to all zero
Writing meta data...
New drbd meta data block successfully created.
success
# drbdadm up tgt1
Repeat these actions on all three nodes and check the status:
# drbdadm status
tgt1 role:Secondary
disk:Inconsistent
pve2 role:Secondary
peer-disk:Inconsistent
pve3 role:Secondary
peer-disk:Inconsistent
Now our disk is Inconsistent on all three nodes, this is because drbd does not know which disk should be taken as the original. We need to mark one of them as Primary , so that its state is synchronized to the other nodes:
drbdadm primary --force tgt1
drbdadm secondary tgt1
Immediately after this, synchronization will start :
# drbdadm status
tgt1 role:Secondary
disk:UpToDate
pve2 role:Secondary
replication:SyncSource peer-disk:Inconsistent done:26.66
pve3 role:Secondary
replication:SyncSource peer-disk:Inconsistent done:14.20
We don’t have to wait until it is finished and we can follow up the next steps in parallel. They can be performed on any node , regardless of its current state of the local disk in DRBD. All requests will be automatically redirected to the device with the UpToDate state.
Don't forget to activate the autorun of the drbd service on the nodes:
systemctl enable drbd.service
Configure LXC Container
Omit part of the configuration of the Proxmox cluster of three nodes, this part is well described in the official wiki
As I said before, our iSCSI target will work in an LXC container . We will keep the container on the device /dev/drbd100
we just created.
First we need to create a file system on it:
mkfs -t ext4 -O mmp -E mmp_update_interval=5 /dev/drbd100
Proxmox by default includes multimount protection at the file system level, in principle we can do without it, because DRBD by default has its own protection, it will simply prohibit the second Primary for the device, but caution will not harm us.
Now download the Ubuntu template:
# wget http://download.proxmox.com/images/system/ubuntu-16.04-standard_16.04-1_amd64.tar.gz -P /var/lib/vz/template/cache/
And create from it our container:
pct create101local:vztmpl/ubuntu-16.04-standard_16.04-1_amd64.tar.gz \
--hostname=tgt1 \--net0=name=eth0,bridge=vmbr0,gw=192.168.1.1,ip=192.168.1.11/24 \--rootfs=volume=/dev/drbd100,shared=1
In this command, we specify that the root system of our container will be on the device /dev/drbd100
and add a parameter shared=1
to allow the migration of the container between nodes.
If something went wrong, you can always fix it through the Proxmox interface or in the container config/etc/pve/lxc/101.conf
Proxmox will unpack the template and prepare the container root system for us. After that we can run our container:
pct start101
Setting up an iSCSI target.
Of the entire set of targets , I chose istgt , since it has the highest performance and works in user space.
Now let's log in to our container:
pct exec101 bash
Install the update and istgt :
apt-get update
apt-get -y upgrade
apt-get -y install istgt
Create a file that we will give over the network:
mkdir -p /datafallocate -l 740G /data/target1.img
Now we need to write a config for istgt/etc/istgt/istgt.conf
:
[Global]
Comment"Global section"
NodeBase "iqn.2018-07.org.example.tgt1"
PidFile /var/run/istgt.pid
AuthFile /etc/istgt/auth.conf
MediaDirectory /var/istgt
LogFacility "local7"Timeout30
NopInInterval 20
DiscoveryAuthMethod Auto
MaxSessions 16
MaxConnections 4
MaxR2T 32
MaxOutstandingR2T 16
DefaultTime2Wait 2
DefaultTime2Retain 60
FirstBurstLength 262144
MaxBurstLength 1048576
MaxRecvDataSegmentLength 262144
InitialR2T Yes
ImmediateData Yes
DataPDUInOrder Yes
DataSequenceInOrder Yes
ErrorRecoveryLevel 0
[UnitControl]
Comment"Internal Logical Unit Controller"
AuthMethod CHAP Mutual
AuthGroup AuthGroup10000
Portal UC1 127.0.0.1:3261
Netmask 127.0.0.1
[PortalGroup1]
Comment"SINGLE PORT TEST"
Portal DA1 192.168.1.11:3260
[InitiatorGroup1]
Comment"Initiator Group1"
InitiatorName "ALL"
Netmask 192.168.1.0/24
[LogicalUnit1]
Comment"Hard Disk Sample"
TargetName disk1
TargetAlias "Data Disk1"Mapping PortalGroup1 InitiatorGroup1
AuthMethod Auto
AuthGroup AuthGroup1
UseDigest Auto
UnitType Disk
LUN0 Storage /data/target1.img Auto
Restart istgt:
systemctl restart istgt
At this point, the target setting is completed.
HA Setup
Now we can go to the HA-manager configuration . Create a separate HA group for our device:
ha-manager groupadd tgt1 --nodes pve1,pve2,pve3 --nofailback=1 --restricted=1
Our resource will work only on the nodes specified for this group. Add our container to this group:
ha-manager add ct:101 --group=tgt1 --max_relocate=3 --max_restart=3
Recommendations and tuning
DRBD
As I noted above, it is always advisable to use a separate network for replication. It is highly desirable to use 10-gigabit network adapters , otherwise you all will rest on the speed of the ports.
If replication seems slow enough, try tuning in some parameters for DRBD . Here is the config that, in my opinion, is optimal for my 10G network :
# cat /etc/drbd.d/global_common.conf
global {
usage-count yes;
udev-always-use-vnr;
}
common {
handlers {
}
startup {
}
options {
}
disk {
c-fill-target 10M;
c-max-rate 720M;
c-plan-ahead 10;
c-min-rate 20M;
}
net {
max-buffers 36k;
sndbuf-size 1024k;
rcvbuf-size 2048k;
}
}
More information about each parameter you can get information from the official documentation DRBD
Open iSCSI
Since we do not use multipathing, in our case it is recommended to disable periodic connection checks on clients, as well as to increase waiting timeouts for restoring the session to /etc/iscsi/iscsid.conf
.
node.conn[0].timeo.noop_out_interval = 0
node.conn[0].timeo.noop_out_timeout = 0
node.session.timeo.replacement_timeout = 86400
Using
Proxmox
The resulting iSCSI target can be immediately connected to Proxmox, without having forgotten to uncheck Use LUN Directly .
Immediately after this, it will be possible to create LVM on top of it, do not forget to tick the shared one :
Other environments
If you plan to use this solution in a different environment, you may need to install a cluster extension for LVM at the moment from two implementations. CLVM and lvmlockd .
Setting up a CLVM is not so trivial and requires a running cluster manager.
Where as the second method lvmlockd - not yet fully tested and is just starting to appear in stable repositories.
I recommend reading a great article on blocking in LVM
When using LVM with Proxmox, cluster addition is not required , since volume management is provided by proxmox itself, which updates and monitors LVM metadata on its own. The same goes for OpenNebula , which is clearly indicated by the official documentation .