Secure storage with DRBD9 and Proxmox (Part 2: iSCSI + LVM)

    image


    In a previous article, I looked at the possibility of creating a fault-tolerant NFS server using DRBD and Proxmox. It turned out pretty well, but we will not stop there and now we will try to "squeeze all the juice" out of our store.


    In this article I will tell you how to create a fault-tolerant iSCSI target in a similar way, which, with the help of LVM, we will cut into small pieces and use for virtual machines.


    This approach will reduce the load and increase the speed of access to data several times, it is especially beneficial when competitive access to data is not required, for example in the case when you need to organize storage for virtual machines.


    A few words about DRBD


    DRBD is a fairly simple and mature solution, the code of the eighth version is adopted in the Linux kernel. In essence, the network mirror is RAID1. In the ninth version, there is support for quorum and replication with more than two nodes.


    In fact, it allows you to combine block devices on several physical nodes into one common network share.


    Using DRBD you can achieve very interesting configurations. Today we will talk about iSCSI and LVM.


    You can learn more about it by reading my previous article , where I described this solution in detail.


    A couple of words about iSCSI


    iSCSI is a protocol for delivering a block device over a network.


    Unlike the NBD, it supports authorization, works without problems with network failures and supports many other useful functions, and most importantly shows very good performance.


    There is a huge number of its implementations, some of them are also included in the kernel and do not require any special difficulties for its configuration and connection.


    A couple of words about LVM


    It is worth mentioning that LINBIT has its own solution for Proxmox, it should work out of the box and allow you to achieve a similar result, but in this article I would not like to focus only on Proxmox and describe some more universal solution that is suitable for both Proxmox and something else, in this example, proxmox is used only as a means of orchestrating containers, in fact, you can replace it with another solution, for example, launch containers with Target in Kubernetes.


    As for Proxmox specifically, it works fine with shared LUN and LVM, using only its own standard drivers.


    The advantages of LVM include the fact that its use is not something revolutionary new and insufficiently run-in, but, on the contrary, it shows dry stability, which is usually required from storage. It is worth mentioning that LVM is quite actively used in other environments, for example, in OpenNebula or in Kubernetes and is supported quite well there.


    Thus, you will receive universal storage that can be used in different systems (not only in proxmox), using only ready-made drivers, without much need to modify it with a file.


    Unfortunately, when choosing a solution for storage, you always have to make some compromises. So here, this solution will not give you the same flexibility as for example Ceph.
    The virtual disk size is limited by the size of the LVM group, and the area marked up for a specific virtual disk will necessarily be preallocated - this greatly improves the speed of access to data, but does not allow for Thin-Provisioning (when the virtual disk takes up less space than it actually is). It is worth mentioning that the performance of LVM sags quite a lot when using snapshots, and therefore the possibility of their free use is often excluded.


    Yes, LVM supports Thin-Provision pools, which are deprived of this disadvantage, but unfortunately their use is possible only in the context of one node and there is no possibility to share one Thin-Provision pool for several nodes in a cluster.


    But despite these shortcomings, due to its simplicity, LVM still does not allow competitors to bypass it and completely oust it from the battlefield.


    With a fairly small overhead, LVM still represents a very fast, stable and reasonably flexible solution.


    General scheme


    • We have three nodes
    • On each node distributed drbd device .
    • On top of the drbd device, an LXC container with iSCSI target is running.
    • Target is connected to all three nodes.
    • A LVM-group is created on the connected target .
    • If necessary, LXC container can move to another node, along with iSCSI target

    Customization


    With the idea sorted out now move on to the implementation.


    By default , the module of the eighth version of drbd is supplied with the Linux kernel , unfortunately it does not suit us and we need to install the module of the ninth version.


    Connect the LINBIT repository and install everything you need:


    wget -O- https://packages.linbit.com/package-signing-pubkey.asc | apt-key add - 
    echo"deb http://packages.linbit.com/proxmox/ proxmox-5 drbd-9.0" \
      > /etc/apt/sources.list.d/linbit.list
    apt-get update && apt-get -y install pve-headers drbd-dkms drbd-utils drbdtop

    • pve-headers - kernel headers needed to build the module
    • drbd-dkms - kernel module in DKMS format
    • drbd-utils - basic utilities for managing DRBD
    • drbdtop - interactive tool like top for DRBD only

    After installing the module, check if everything is fine with it:


    # modprobe drbd# cat /proc/drbd 
    version: 9.0.14-1 (api:2/proto:86-113)

    If you see the eighth version in the output of the command , something went wrong and the in-tree kernel module is loaded . Check dkms statusout the reason.


    Each node will have the same drbd device running on top of the usual partitions. First we need to prepare this section for drbd on each node.


    As such a partition can be any block device , it can be lvm, zvol, a disk partition or the entire disk. In this article I will use a separate nvme disk with a partition for drbd:/dev/nvme1n1p1


    It is worth noting that device names tend to change sometimes, so it’s better to take the habit of using a permanent symlink on a device right away.


    You /dev/nvme1n1p1can find such a symlink for this:


    # find /dev/disk/ -lname '*/nvme1n1p1'
    /dev/disk/by-partuuid/847b9713-8c00-48a1-8dff-f84c328b9da2
    /dev/disk/by-path/pci-0000:0e:00.0-nvme-1-part1
    /dev/disk/by-id/nvme-eui.0000000001000000e4d25c33da9f4d01-part1
    /dev/disk/by-id/nvme-INTEL_SSDPEKKA010T7_BTPY703505FB1P0H-part1

    We describe our resource on all three nodes:


    # cat /etc/drbd.d/tgt1.res
    resource tgt1 {
      meta-disk internal;
      device    /dev/drbd100;
      protocol  C;
      net { 
        after-sb-0pri discard-zero-changes;
        after-sb-1pri discard-secondary;
        after-sb-2pri disconnect;
      }
      on pve1 {
        address   192.168.2.11:7000;
        disk      /dev/disk/by-partuuid/95e7eabb-436e-4585-94ea-961ceac936f7;
        node-id   0;
      }
      on pve2 {
        address   192.168.2.12:7000;
        disk      /dev/disk/by-partuuid/aa7490c0-fe1a-4b1f-ba3f-0ddee07dfee3;
        node-id   1;
      }
      on pve3 {
        address   192.168.2.13:7000;
        disk      /dev/disk/by-partuuid/847b9713-8c00-48a1-8dff-f84c328b9da2;
        node-id   2;
      }
      connection-mesh {
        hosts pve1 pve2 pve3;
      }
    }

    It is advisable to synchronize drbd to use a separate network .


    Now create the metadata for drbd and launch it:


    # drbdadm create-md tgt1
    initializing activity log
    initializing bitmap (320 KB) to all zero
    Writing meta data...
    New drbd meta data block successfully created.
    success
    # drbdadm up tgt1

    Repeat these actions on all three nodes and check the status:


    # drbdadm status
    tgt1 role:Secondary
      disk:Inconsistent
      pve2 role:Secondary
        peer-disk:Inconsistent
      pve3 role:Secondary
        peer-disk:Inconsistent

    Now our disk is Inconsistent on all three nodes, this is because drbd does not know which disk should be taken as the original. We need to mark one of them as Primary , so that its state is synchronized to the other nodes:


    drbdadm primary --force tgt1
    drbdadm secondary tgt1

    Immediately after this, synchronization will start :


    # drbdadm status
    tgt1 role:Secondary
      disk:UpToDate
      pve2 role:Secondary
        replication:SyncSource peer-disk:Inconsistent done:26.66
      pve3 role:Secondary
        replication:SyncSource peer-disk:Inconsistent done:14.20
    

    We don’t have to wait until it is finished and we can follow up the next steps in parallel. They can be performed on any node , regardless of its current state of the local disk in DRBD. All requests will be automatically redirected to the device with the UpToDate state.


    Don't forget to activate the autorun of the drbd service on the nodes:


    systemctl enable drbd.service

    Configure LXC Container


    Omit part of the configuration of the Proxmox cluster of three nodes, this part is well described in the official wiki


    As I said before, our iSCSI target will work in an LXC container . We will keep the container on the device /dev/drbd100we just created.


    First we need to create a file system on it:


    mkfs -t ext4 -O mmp -E mmp_update_interval=5 /dev/drbd100

    Proxmox by default includes multimount protection at the file system level, in principle we can do without it, because DRBD by default has its own protection, it will simply prohibit the second Primary for the device, but caution will not harm us.


    Now download the Ubuntu template:


    # wget http://download.proxmox.com/images/system/ubuntu-16.04-standard_16.04-1_amd64.tar.gz -P /var/lib/vz/template/cache/

    And create from it our container:


    pct create101local:vztmpl/ubuntu-16.04-standard_16.04-1_amd64.tar.gz \
      --hostname=tgt1 \--net0=name=eth0,bridge=vmbr0,gw=192.168.1.1,ip=192.168.1.11/24 \--rootfs=volume=/dev/drbd100,shared=1

    In this command, we specify that the root system of our container will be on the device /dev/drbd100and add a parameter shared=1to allow the migration of the container between nodes.


    If something went wrong, you can always fix it through the Proxmox interface or in the container config/etc/pve/lxc/101.conf


    Proxmox will unpack the template and prepare the container root system for us. After that we can run our container:


    pct start101

    Setting up an iSCSI target.


    Of the entire set of targets , I chose istgt , since it has the highest performance and works in user space.


    Now let's log in to our container:


    pct exec101 bash

    Install the update and istgt :


    apt-get update 
    apt-get -y upgrade
    apt-get -y install istgt

    Create a file that we will give over the network:


    mkdir -p /datafallocate -l 740G /data/target1.img

    Now we need to write a config for istgt/etc/istgt/istgt.conf :


    [Global]
      Comment"Global section"
      NodeBase "iqn.2018-07.org.example.tgt1"
      PidFile /var/run/istgt.pid
      AuthFile /etc/istgt/auth.conf
      MediaDirectory /var/istgt
      LogFacility "local7"Timeout30
      NopInInterval 20
      DiscoveryAuthMethod Auto
      MaxSessions 16
      MaxConnections 4
      MaxR2T 32
      MaxOutstandingR2T 16
      DefaultTime2Wait 2
      DefaultTime2Retain 60
      FirstBurstLength 262144
      MaxBurstLength 1048576
      MaxRecvDataSegmentLength 262144
      InitialR2T Yes
      ImmediateData Yes
      DataPDUInOrder Yes
      DataSequenceInOrder Yes
      ErrorRecoveryLevel 0
    [UnitControl]
      Comment"Internal Logical Unit Controller"
      AuthMethod CHAP Mutual
      AuthGroup AuthGroup10000
      Portal UC1 127.0.0.1:3261
      Netmask 127.0.0.1
    [PortalGroup1]
      Comment"SINGLE PORT TEST"
      Portal DA1 192.168.1.11:3260
    [InitiatorGroup1]
      Comment"Initiator Group1"
      InitiatorName "ALL"
      Netmask 192.168.1.0/24
    [LogicalUnit1]
      Comment"Hard Disk Sample"
      TargetName disk1
      TargetAlias "Data Disk1"Mapping PortalGroup1 InitiatorGroup1
      AuthMethod Auto
      AuthGroup AuthGroup1
      UseDigest Auto
      UnitType Disk
      LUN0 Storage /data/target1.img Auto

    Restart istgt:


    systemctl restart istgt

    At this point, the target setting is completed.


    HA Setup


    Now we can go to the HA-manager configuration . Create a separate HA group for our device:


    ha-manager groupadd tgt1 --nodes pve1,pve2,pve3 --nofailback=1 --restricted=1

    Our resource will work only on the nodes specified for this group. Add our container to this group:


    ha-manager add ct:101 --group=tgt1 --max_relocate=3 --max_restart=3

    Recommendations and tuning


    DRBD

    As I noted above, it is always advisable to use a separate network for replication. It is highly desirable to use 10-gigabit network adapters , otherwise you all will rest on the speed of the ports.
    If replication seems slow enough, try tuning in some parameters for DRBD . Here is the config that, in my opinion, is optimal for my 10G network :


    # cat /etc/drbd.d/global_common.conf
    global {
     usage-count yes;
     udev-always-use-vnr; 
    }
    common {
     handlers {
     }
     startup {
     }
     options {
     }
     disk {
      c-fill-target 10M;
      c-max-rate   720M;
      c-plan-ahead   10;
      c-min-rate    20M;
     }
     net {
      max-buffers     36k;
      sndbuf-size   1024k;
      rcvbuf-size   2048k;
     }
    }

    More information about each parameter you can get information from the official documentation DRBD


    Open iSCSI

    Since we do not use multipathing, in our case it is recommended to disable periodic connection checks on clients, as well as to increase waiting timeouts for restoring the session to /etc/iscsi/iscsid.conf.


    node.conn[0].timeo.noop_out_interval = 0
    node.conn[0].timeo.noop_out_timeout = 0
    node.session.timeo.replacement_timeout = 86400

    Using


    Proxmox


    The resulting iSCSI target can be immediately connected to Proxmox, without having forgotten to uncheck Use LUN Directly .



    Immediately after this, it will be possible to create LVM on top of it, do not forget to tick the shared one :



    Other environments


    If you plan to use this solution in a different environment, you may need to install a cluster extension for LVM at the moment from two implementations. CLVM and lvmlockd .


    Setting up a CLVM is not so trivial and requires a running cluster manager.
    Where as the second method lvmlockd - not yet fully tested and is just starting to appear in stable repositories.


    I recommend reading a great article on blocking in LVM


    When using LVM with Proxmox, cluster addition is not required , since volume management is provided by proxmox itself, which updates and monitors LVM metadata on its own. The same goes for OpenNebula , which is clearly indicated by the official documentation .


    Also popular now: