Creating Trusted iSCSI Storage on Linux, Part 1

  • Tutorial
Part two


Today I will tell you how I created a budget fail-safe iSCSI storage from two Linux-based servers to serve the needs of a VMWare vSphere cluster. There were similar articles ( for example ), but my approach is somewhat different, and the solutions (the same heartbeat and iscsitarget) used there are already outdated.

The article is intended for experienced enough administrators who are not afraid of the phrase “patch and compile the kernel”, although some parts could be simplified and dispensed with without compilation, but I will write how I did it myself. I will skip some simple things so as not to inflate the material. The purpose of this article is to show the general principles rather than to outline everything step by step.


My requirements were simple: create a cluster for virtual machines that did not have a single point of failure. And as a bonus - the storage had to be able to encrypt data so that the enemies, having dragged the server, did not reach them.

VSphere was chosen as the hypervisor, as the most well-established and finished product, and iSCSI was chosen as the protocol, as it does not require additional financial injections in the form of FC or FCoE switches. With open source SAS targets, it’s rather tight, if not worse, so this option was also rejected.

Remained storage. Various branded solutions from leading vendors were discarded due to the high cost of both themselves and licenses for synchronous replication. So we will do it ourselves, at the same time and learn.

As software was selected:
  • Debian Wheezy + LTS Core 3.10
  • iSCSI Target SCST
  • DRBD for replication
  • Pacemaker for cluster resource management and monitoring
  • DM-Crypt core subsystem for encryption (AES-NI instructions in the processor will help us a lot)

As a result, in such a short torment, such a simple scheme was born:
It shows that each of the servers has 10 gigabit interfaces (2 built-in and 4 on additional network cards). 6 of them are connected to the switch stack (3 to each), and the remaining 4 to the neighbor server.
On them replication through DRBD will also go. Replication cards, if desired, can be replaced with 10-Gbps, but I had these on hand, so "I blinded it from what it was."

Thus, the sudden death of any of the cards will not lead to the complete inoperability of any of the subsystems.

Since the main task of these storages is reliable storage of large data (file server, mail archives, etc.), we selected servers with 3.5 "drives:

For business


I created two RAID10 arrays of 8 disks on each server.
I decided to refuse RAID6 since there was enough space, and the performance of RAID10 on random access tasks is higher. Plus, below the rebuild time and load in this case goes only to one disk, and not to the entire array at once.

In general, here everyone decides for himself.

Network part

With the iSCSI protocol, it makes no sense to use the Bonding / Etherchannel to speed it up.
The reason is simple - at the same time, hash functions are used to distribute packets over the channels, so it is very difficult to select such IP / MAC addresses so that the packet from IP1 to IP2 goes on one channel, and from IP1 to IP3 on another.
There is even a command on cisco that allows you to look at which of the Etherchannel interfaces the packet will fly:
# test etherchannel load-balance interface port-channel 1 ip
Would select Gi2/1/4 of Po1

Therefore, for our purposes, the use of several paths to the LUN is much better suited, which we will configure.

On the switch, I created 6 VLANs (one for each external interface of the server):
stack-3750x# sh vlan | i iSCSI
24   iSCSI_VLAN_1                     active
25   iSCSI_VLAN_2                     active
26   iSCSI_VLAN_3                     active
27   iSCSI_VLAN_4                     active
28   iSCSI_VLAN_5                     active
29   iSCSI_VLAN_6                     active

Interfaces were made trunked for versatility and something else will be seen later:
interface GigabitEthernet1/0/11
 description VMSTOR1-1
 switchport trunk encapsulation dot1q
 switchport mode trunk
 switchport nonegotiate
 flowcontrol receive desired
 spanning-tree portfast trunk

The MTU on the switch should be set to the maximum to reduce the load on the server (more packet -> less packets per second -> less interruption is generated). In my case, it is 9198:
(config)# system mtu jumbo 9198

ESXi does not support MTUs over 9000, so there is still some margin.

Each VLAN was assigned an address space, for simplicity it looks like this: 10.1. VLAN_ID .0 / 24 (for example, With a shortage of addresses, you can keep within smaller subnets, but it’s more convenient.

Each LUN will be represented by a separate iSCSI target, so each common target has been selected with “common” cluster addresses, which will be raised on the node serving this target at the moment: 10.1. VLAN_ID .10 and 10.1. VLAN_ID .20

Also, the servers will have permanent addresses for management, in my case they are ​​and .200 (in a separate VLAN)


So, here we install Debian on both servers in a minimal form, I will not dwell on this in detail.

Package assembly

I conducted the assembly on a separate virtual machine, so as not to clutter up the server with compilers and sources.
To build the kernel under Debian, it’s enough to put the build-essential meta-package and, perhaps, something else, I don’t remember exactly.

Download the latest kernel 3.10 from : and unpack it:
# cd /usr/src/
# wget
# tar xJf linux-3.10.27.tar.xz

Next, download through SVN the latest revision of the stable SCST branch, generate a patch for our kernel version and apply it:
# svn checkout svn:// scst-svn
# cd scst-svn
# scripts/generate-kernel-patch 3.10.27 > ../scst.patch
# cd linux-3.10.27
# patch -Np1 -i ../scst.patch

Now build the iscsi-scstd daemon:
# cd scst-svn/iscsi-scst/usr
# make

The resulting iscsi-scstd will need to be put on the server, for example, in / opt / scst

Next, we configure the kernel for our server.
Turn on encryption (if necessary).

Do not forget to include these options for SCST and DRBD:

We collect it in the form of a .deb package (for this you need to install the fakeroot, kernel-package and debhelper at the same time):
# fakeroot make-kpkg clean prepare
# fakeroot make-kpkg --us --uc --stem=kernel-scst --revision=1 kernel_image

At the output, we get the kernel-scst-image-3.10.27_1_amd64.deb

package. Next, we collect the package for DRBD:
# wget
# tar xzf drbd-8.4.4.tar.gz
# cd drbd-8.4.4
# dh_make --native --single
После вопроса жмакаем Enter

We change the debian / rules file to the following state (there is a standard file there, but it does not collect kernel modules):
#!/usr/bin/make -f
# Путь до исходников ядра
export KDIR="/usr/src/linux-3.10.27"
<тут два таба, без них не будет работать>
<и тут тоже>
        ./configure \
        --prefix=/usr \
        --localstatedir=/var \
        --sysconfdir=/etc \
        --with-pacemaker \
        --with-utils \
        --with-km \
        --with-udev \
        --with-distro=debian \
        --without-xen \
        --without-heartbeat \
        --without-legacy_utils \
        --without-rgmanager \
        dh $@

In the file, we will correct the SUBDIRS variable, remove the documentation from it , otherwise the package will not be collected with a curse on the documentation.

We collect:
# dpkg-buildpackage -us -uc -b

We get the drbd_8.4.4_amd64.deb package .

Everything, you don’t need to collect anything else, copy both packages to the servers and install:
# dpkg -i kernel-scst-image-3.10.27_1_amd64.deb
# dpkg -i drbd_8.4.4_amd64.deb

Server Configuration


The interfaces were renamed to /etc/udev/rules.d/70-persistent-net.rules as follows:
int1-6 go to the switch, and drbd1-4 go to the neighboring server.

/ etc / network / interfaces has an extremely frightening appearance, which is not even a dream in a nightmare:
auto lo
iface lo inet loopback
# Interfaces
auto int1
iface int1 inet manual
    up ip link set int1 mtu 9000 up
    down ip link set int1 down
auto int2
iface int2 inet manual
    up ip link set int2 mtu 9000 up
    down ip link set int2 down
auto int3
iface int3 inet manual
    up ip link set int3 mtu 9000 up
    down ip link set int3 down
auto int4
iface int4 inet manual
    up ip link set int4 mtu 9000 up
    down ip link set int4 down
auto int5
iface int5 inet manual
    up ip link set int5 mtu 9000 up
    down ip link set int5 down
auto int6
iface int6 inet manual
    up ip link set int6 mtu 9000 up
    down ip link set int6 down
# Management interface
auto int1.2
iface int1.2 inet manual
    up ip link set int1.2 mtu 1500 up
    down ip link set int1.2 down
    vlan_raw_device int1
auto int2.2
iface int2.2 inet manual
    up ip link set int2.2 mtu 1500 up
    down ip link set int2.2 down
    vlan_raw_device int2
auto int3.2
iface int3.2 inet manual
    up ip link set int3.2 mtu 1500 up
    down ip link set int3.2 down
    vlan_raw_device int3
auto int4.2
iface int4.2 inet manual
    up ip link set int4.2 mtu 1500 up
    down ip link set int4.2 down
    vlan_raw_device int4
auto int5.2
iface int5.2 inet manual
    up ip link set int5.2 mtu 1500 up
    down ip link set int5.2 down
    vlan_raw_device int5
auto int6.2
iface int6.2 inet manual
    up ip link set int6.2 mtu 1500 up
    down ip link set int6.2 down
    vlan_raw_device int6
auto bond_vlan2
iface bond_vlan2 inet static
    slaves int1.2 int2.2 int3.2 int4.2 int5.2 int6.2
    bond-mode active-backup
    bond-primary int1.2
    bond-miimon 100
    bond-downdelay 200
    bond-updelay 200
    mtu 1500
auto int1.24
iface int1.24 inet manual
    up ip link set int1.24 mtu 9000 up
    down ip link set int1.24 down
    vlan_raw_device int1
auto int2.25
iface int2.25 inet manual
    up ip link set int2.25 mtu 9000 up
    down ip link set int2.25 down
    vlan_raw_device int2
auto int3.26
iface int3.26 inet manual
    up ip link set int3.26 mtu 9000 up
    down ip link set int3.26 down
    vlan_raw_device int3
auto int4.27
iface int4.27 inet manual
    up ip link set int4.27 mtu 9000 up
    down ip link set int4.27 down
    vlan_raw_device int4
auto int5.28
iface int5.28 inet manual
    up ip link set int5.28 mtu 9000 up
    down ip link set int5.28 down
    vlan_raw_device int5
auto int6.29
iface int6.29 inet manual
    up ip link set int6.29 mtu 9000 up
    down ip link set int6.29 down
    vlan_raw_device int6
# DRBD bonding
auto bond_drbd
iface bond_drbd inet static
    slaves drbd1 drbd2 drbd3 drbd4
    bond-mode balance-rr
    mtu 9216

Since we also want to have fault tolerance for server management, we use a military trick: in bonding in active-backup mode, we collect not the interfaces themselves, but VLAN subinterfaces. Thus, the server will be available as long as at least one interface is running. This is redundant, but purcua would not be pa. And at the same time, the same interfaces can be freely used for iSCSI traffic.

For replication, the bond_drbd interface was created in balance-rr mode , in which packets are sent stupidly sequentially across all interfaces. He was assigned an address from the gray network / 24, but one could manage with / 30 or / 31 as there will be only two hosts.

Since this sometimes leads to the arrival of packets out of turn, we increase the buffer of extraordinary packets in/etc/sysctl.conf . Below I give the whole file, which options do I will not explain, for a very long time. You can read it yourself if you wish.
net.ipv4.tcp_reordering = 127
net.core.rmem_max = 33554432
net.core.wmem_max = 33554432
net.core.rmem_default = 16777216
net.core.wmem_default = 16777216
net.ipv4.tcp_rmem = 131072 524288 33554432
net.ipv4.tcp_wmem = 131072 524288 33554432
net.ipv4.tcp_no_metrics_save = 1
net.ipv4.tcp_window_scaling = 1
net.ipv4.tcp_timestamps = 0
net.ipv4.tcp_sack = 0
net.ipv4.tcp_dsack = 0
net.ipv4.tcp_fin_timeout = 15
net.core.netdev_max_backlog = 300000
vm.min_free_kbytes = 720896

According to the test results, the replication interface produces somewhere around 3.7 Gb / s , which is quite acceptable.

Since we have a multi-core server, and network cards and a RAID controller can separate the processing of interrupts between several queues, a script was written that links interrupts to the kernels:
#!/usr/bin/perl -w
use strict;
use warnings;
my $irq = 77;
my $ifs = 11;
my $queues = 6;
my $skip = 1;
my @tmpl = ("0", "0", "0", "0", "0", "0");
print "Applying IRQ affinity...\n";
for(my $if = 0; $if < $ifs; $if++) {
    for(my $q = 0; $q < $queues; $q++, $irq++) {
        my @tmp = @tmpl;
        $tmp[$q] = 1;
        my $mask = join("", @tmp);
        my $hexmask = bin2hex($mask);
        #print $irq . " -> " . $hexmask . "\n";
        open(OUT, ">/proc/irq/".$irq."/smp_affinity");
        print OUT $hexmask."\n";
    $irq += $skip;
sub bin2hex {
    my ($bin) = @_;
    return sprintf('%x', oct("0b".scalar(reverse($bin))));


Before exporting disks, we will encrypt them and secure master keys for every fireman:
# cryptsetup luksFormat --cipher aes-cbc-essiv:sha256 --hash sha256 /dev/sdb
# cryptsetup luksFormat --cipher aes-cbc-essiv:sha256 --hash sha256 /dev/sdc
# cryptsetup luksHeaderBackup /dev/sdb  --header-backup-file /root/header_sdb.bin
# cryptsetup luksHeaderBackup /dev/sdc  --header-backup-file /root/header_sdc.bin

The password must be written on the inside of the skull and never be forgotten, and the key backups should be hidden to hell.
It must be borne in mind that after changing the password on the backup sections of the master key, it will be possible to decrypt the old password.

Next, a script was written to simplify decryption:
#!/usr/bin/perl -w
use strict;
use warnings;
use IO::Prompt;
my %crypto_map = (
    '1bd1f798-d105-4150-841b-f2751f419efc' => 'VM_STORAGE_1',
    'd7fcdb2b-88fd-4d19-89f3-5bdf3ddcc456' => 'VM_STORAGE_2'
my $i = 0;
my $passwd = prompt('Password: ', '-echo' => '*');
foreach my $dev (sort(keys(%crypto_map))) {
    if(-e '/dev/mapper/'.$crypto_map{$dev}) {
        print "Mapping '".$crypto_map{$dev}."' already exists, skipping\n";
    my $ret = system('echo "'.$passwd.'" | /usr/sbin/cryptsetup luksOpen /dev/disk/by-uuid/'.$dev.' '.$crypto_map{$dev});
    if($ret == 0) {
        print $i . ' Crypto mapping '.$dev.' => '.$crypto_map{$dev}.' added successfully' . "\n";
    } else {
        print $i . ' Failed to add mapping '.$dev.' => '.$crypto_map{$dev} . "\n";

The script works with UUIDs of disks to always uniquely identify a disk in the system without binding to / dev / sd * .

The encryption speed depends on the processor frequency and the number of cores, and the recording is parallelized better than reading. You can check how fast the server encrypts in the following simple way:
Создаем виртуальный диск, запись на который будет уходить в никуда, а чтение выдавать нули
# echo "0 268435456 zero" | dmsetup create zerodisk
Создаем на его основе шифрованный виртуальный диск
# cryptsetup --cipher aes-cbc-essiv:sha256 --hash sha256 create zerocrypt /dev/mapper/zerodisk
Enter passphrase: <любой пароль>
Меряем скорость:
# dd if=/dev/zero of=/dev/mapper/zerocrypt bs=1M count=16384
16384+0 records in
16384+0 records out
17179869184 bytes (17 GB) copied, 38.3414 s, 448 MB/s
# dd of=/dev/null if=/dev/mapper/zerocrypt bs=1M count=16384
16384+0 records in
16384+0 records out
17179869184 bytes (17 GB) copied, 74.5436 s, 230 MB/s

As you can see, the speeds are not so hot, but they will rarely be achieved in practice since usually random access prevails.

For comparison, the results of the same test on the new Xeon E3-1270 v3 on the Haswell core :
# dd if=/dev/zero of=/dev/mapper/zerocrypt bs=1M count=16384
16384+0 records in
16384+0 records out
17179869184 bytes (17 GB) copied, 11.183 s, 1.5 GB/s
# dd of=/dev/null if=/dev/mapper/zerocrypt bs=1M count=16384
16384+0 records in
16384+0 records out
17179869184 bytes (17 GB) copied, 19.4902 s, 881 MB/s

Well, here it’s much more fun. Frequency is a decisive factor, apparently.
And if you deactivate AES-NI, it will be several times slower.


We configure replication, configs from both ends should be 100% identical.

global {
    usage-count no;
common {
    protocol B;
    handlers {
    startup {
        wfc-timeout 10;
    disk {
        c-plan-ahead 0;
        al-extents 6433;
        resync-rate 400M;
        disk-barrier no;
        disk-flushes no;
        disk-drain yes;
    net {
        sndbuf-size 1024k;
        rcvbuf-size 1024k;
        max-buffers 8192; # x PAGE_SIZE
        max-epoch-size 8192; # x PAGE_SIZE
        unplug-watermark 8192;
        timeout 100;
        ping-int 15;
        ping-timeout 60; # x 0.1sec
        connect-int 15;
        timeout 50; # x 0.1sec
        verify-alg sha1;
        csums-alg sha1;
        data-integrity-alg crc32c;
        cram-hmac-alg sha1;
        shared-secret "ultrasuperdupermegatopsecretpassword";

Here the most interesting parameter is the protocol, compare them.

Recording is considered successful if the block was recorded ...
  • A - to the local disk and hit the local send buffer
  • B - to the local disk and hit the remote receive buffer
  • C - to local and remote disk

The slowest (read - high latency) and, at the same time, reliable is C , and I chose the middle ground.

Next is the definition of the resources that DRBD and the nodes involved in their replication operate on.

resource VM_STORAGE_1 {
    device /dev/drbd0;
    disk /dev/mapper/VM_STORAGE_1;
    meta-disk internal;
    on vmstor1 {
    on vmstor2 {

resource VM_STORAGE_2 {
    device /dev/drbd1;
    disk /dev/mapper/VM_STORAGE_2;
    meta-disk internal;
    on vmstor1 {
    on vmstor2 {

Each resource has its own port.

Now we initialize the DRBD resource metadata and activate it, this must be done on each server:
# drbdadm create-md VM_STORAGE_1
# drbdadm create-md VM_STORAGE_2
# drbdadm up VM_STORAGE_1
# drbdadm up VM_STORAGE_2

Next, you need to select one server (you can have one for each resource) and determine that it is the main one and primary synchronization will go from it to another:
# drbdadm primary --force VM_STORAGE_1
# drbdadm primary --force VM_STORAGE_2

Everything, let's go, synchronization has begun.
Depending on the size of the arrays and the speed of the network, it will take a long or very long time.

Its progress can be observed with the watch -n0.1 cat / proc / drbd command , it is very pacifying and philosophical.
In principle, devices can already be used in the synchronization process, but I advise you to relax :)

End of the first part

For one piece, I think that's enough. And so much information is already absorbed.

In the second part, I will talk about setting up the cluster manager and ESXi hosts to work with this share.

Also popular now: