iSCSI storage for the poor

Good day, dear community!

In this article, I would like to share the experience of creating a disk storage, which resulted in many experiments, trials, errors, finds, seasoned with bitter disappointments. And finally, it ended in some interesting, relatively budget and fast storage.

If you have a similar task or you are just interested in the headline, then welcome to habrakat.

Prologue


So, recently our department faced the task of providing a cluster of VMware ESXi 5.1 hypervisors with a large volume of storage. On it, we planned to place encrypted maildir for dovecot and “cloud” file storage. A prerequisite for budget allocation was to provide a place to store information critical for the company, and this section should be encrypted.

Iron


Unfortunately, or perhaps fortunately, we were not burdened with a large budget for such ambitious tasks. Therefore, we, as true admirers of maximalism, could not afford any branded storehouse, and within the allocated material resources we chose the following hardware:

  • Server chassis Supermicro CSE-836BE16-R920B
    There was a lot of reasoning: we chose the number of units, the size of the hard drives, their speed, the case or the platform at once, reviewed many options, smoked the Internet and eventually settled on this option as the optimal one for our tasks.
  • Motherboard Supermicro MBD-X9DRI-FO
    The main condition was the presence of 4 PCI-E x8 ports.
  • Processors Intel Xeon E5-2603
    The choice was simple - for which there was enough money. In addition, I had to install 2 processors at once, and not first one, then, if necessary, I would buy it, because only 3 PCI-E works with one occupied slot, but we needed 4.
  • Drives the Seagate of Constellation ES.3 ST3000NM0033
    the SATA because it is cheaper, and the same money we received multiple greater amount of space than when using SAS.
  • RAID controller Adaptec ASR-7805Q
    Since this is a storage system, they didn’t start trifling with the controller. This series has SSD caching, which would be very useful to us, and there is BBU immediately in the kit, which is also a very useful option.
  • SSDs Intel SSDSC2CW240A310
    They were needed exclusively for MaxCache to work (it is also an SSD cache).
  • Intel X520 DA2 Network Cards
    To avoid a bottleneck on network interfaces, it was necessary to provide a 10Gb link between ESXi nodes and storage. After studying the market offers, we may have come to not the most elegant one, but to the option that is suitable in price and speed using 10 gigabit network cards.

All this cost us about 200 thousand rubles.

Implementation


We decided to give out targets, that is, to allocate storage resources to consumers, using iSCSI and NFS. The most reasonable and quickest solution, of course, would be to use FCoE so as not to get into TCP with the corresponding overhead, which, in general, could be done with our network cards, but, unfortunately, we do not have an SFP switch with FCoE support, it was not possible to buy it, since it would cost us 500 tr from above.
Having smoked the Internet again, they found a way out of this in the vn2vn technology, but ESXi will learn to work with vn2vn only in the 6.x version, therefore, without thinking further, they started to work on what is.

Our corporate standard for Linux servers is CentOS, but in the current kernel (2.6.32-358) encryption is very slow, so I had to use Fedora as the OS. Of course, this is a Red Hat training ground, but in the latest Linux kernels, data is encrypted almost on the fly, and we don’t seem to need the rest.
In addition, the current version 19 will be used as the basis for RHEL 7, and therefore will allow us to seamlessly switch to CentOS 7 in the future.

Targeting


In order not to inflate the article and not move away from the topic, I omit all uninteresting types of assembly of iron, butting with a controller, installing the OS and other things. I will also try to describe the target as little as possible and limit myself only to its work with the ESXi initiator.

From the target we wanted to get the following:
  • properly working caching - disks are rather slow, they can only squeeze out 2,000 iops;
  • Read the highest possible speed of the direct disk subsystem as a whole (give as many iops as possible).

Meet, here they are.

Lio
linux-iscsi.org
With the Linux kernel 3.10.10, it showed me 300 MB / s write and 600 MB / s read in blockio mode. He showed the same numbers with fileio and also with a RAM disk. The graphs showed that the write speed jumps very much, probably due to the fact that the ESXi initiator requires write synchronization. For the same reason, the number of IOPS per record was the same with fileio and blockio.
In the mailing lists, it was recommended to disable emulate_fua_write, but this did not lead to any changes. Moreover, with the 3.9.5 kernel it shows the best results, which also makes us think about its future.
LIO, judging by the description, can do much more, but most features are available only in the commercial version. The site, which, in my opinion, should be primarily a source of information, is full of advertisements, which causes a negative. As a result, they decided to refuse.

istgt
www.peach.ne.jp/archives/istgt
Used by FreeBSD.
The target works well enough, except for a few but.
Firstly, he does not know how to blockio, and secondly, he cannot use different MaxRec and MaxXtran, at least I did not succeed. At low MaxRec values, sequential write speed did not exceed 250 MB / s, and reading was at a very high level - 700 MB / s. I got about 40K iops from a random 4k recording with a queue depth of 32. With an increase in MaxRec, the write speed increases to 700 MB / s, reading drops to 600 MB / s. Iops fall to 30K and 20K to write.
That is, somehow it would be possible to find a middle ground by changing the settings, but somehow it seemed not trues.

STGT
stgt.sourceforge.net
There was a problem with this target setting up the interface with the hypervisor. ESXi constantly confused LUN - accepted one for another, or stopped seeing at all. There was a suspicion of a problem in incorrect binding of serial numbers, but registering them in configs did not help.
The speed is also not pleased. He succeeded in getting more than 500 MB / s neither for reading nor for writing. The number of IOPS for reading is 20K, for writing about 15K.
As a result - problems with the config and low performance in speed. Refuse.

IET
iscsitarget.sourceforge.net
Worked almost flawlessly. Read and write 700MB / s. IOPS for reading about 30K, for writing 2000.
The initiator of ESXi forced the target to write data to disk immediately, without using the system cache. Also somewhat scared reviews about him in the maillists - many reported unstable work under load.

SCST
scst.sourceforge.net
And finally got to the leader of our race.
After rebuilding the kernel and minimizing the target itself, we got 750MB / s of reading and 950MB / s of writing. IOPS in fileio mode - 44K for reading and 37K for writing. Immediately, almost without a tambourine.
This target seemed like the perfect choice.

iSCSI for VMWare ESXi 5.1 on SCST and Fedora


And now, in fact, for what we all gathered here.
A small instruction on setting the target and initiator of ESXi. I didn’t immediately decide to try to write an article on Habr, so the instruction will not be step-by-step - I’ll restore it from memory, but it will contain the main points of the settings that allowed to achieve the desired results.

ESXi 5.1 Preparation

The following settings are made in the hypervisor:
  • in the iSCSI initiator settings, the Delayed ACK is disabled for all targets. Made in accordance with: kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=1002598
  • initiator parameters are changed in accordance with the target parameters:
    InitialR2T = No
    ImmediateData = Yes
    MaxConnections = 1
    MaxRecvDataSegmentLength = 1048576
    MaxBurstLength = 1048576
    FirstBurstLength = 65536
    DefaultTime2Wait = 0
    DefaultTime2Retain = 0
    MaxOutstandingR2T = 32
    DataPDUInOrder = No
    DataSequenceInOrder = No
    ErrorRecoveryLevel = 0
    HeaderDigest = None
    DataDigest = None
    OFMarker = No
    IFMarker = No
    OFMarkInt = Reject
    IFMarkInt = Reject


You will need to disable Interrupt Moderation and LRO for network adapters. You can do this with the commands:

ethtool -C vmnicX rx-usecs 0 rx-frames 1 rx-usecs-irq 0 rx-framesirq 0
esxcfg-advcfg -s 0 /Net/TcpipDefLROEnabled
esxcli system module parameters set -m ixgbe -p "InterruptThrottleRate=0"


Reasons to do this are:
www.odbms.org/download/vmw-vfabric-gemFire-best-practices-guide.pdf
www.vmware.com/files/pdf/techpaper/VMW-Tuning-Latency-Sensitive-Workloads. pdf

In order not to set these values ​​again, you can add them to this script:
/etc/rc.local.d/local.sh


Fedora Preparation

Download and install in the minimum version the latest version of Fedora.

Update the system and reboot:

[root@nas ~]$ yum -y update && reboot


The system will only work on the local network, so I turned off the firewall and SELinux:

[root@nas ~]$ systemctl stop firewalld.service
[root@nas ~]$ systemctl disable firewalld.service 
[root@nas ~]$ cat /etc/sysconfig/selinux
SELINUX=disabled
SELINUXTYPE=targeted


Set up network interfaces and disable the NetworkManager.service service. It is not compatible with BRIDGE interfaces, and this was necessary for NFS.

[root@nas ~]$ systemctl disable NetworkManager.service 
[root@nas ~]$ chkconfig network on 


Disabled LRO on network cards.

[root@nas ~]$ cat /etc/rc.d/rc.local
#!/bin/bash
ethtool -K ethX lro off


Following Intel recommendations, the following system parameters were changed:

[root@nas ~]$ cat /etc/sysctl.d/ixgbe.conf
net.ipv4.tcp_sack = 0
net.ipv4.tcp_timestamps = 0
net.ipv4.tcp_rmem = 10000000 10000000 10000000
net.ipv4.tcp_wmem = 10000000 10000000 10000000
net.ipv4.tcp_mem = 10000000 10000000 10000000
net.core.rmem_max = 524287
net.core.wmem_max = 524287
net.core.rmem_default = 524287
net.core.wmem_default = 524287
net.core.optmem_max = 524287
net.core.netdev_max_backlog = 300000


Target preparation

To use SCST, it is recommended to add patches to the kernel. This is optional, but performance is higher with them.
During the creation of the repository, the latest version of the kernel was 3.10.10-200. By the time you read the article, the kernel may already be updated, but I do not think that this will greatly affect the process.

Creating an rpm package with a modified kernel is described in detail here:
fedoraproject.org/wiki/Building_a_custom_kernel/en

But in order to avoid difficulties I will describe the preparation in detail.

Create a user:
[root@nas ~]$ useradd mockbuild


Let's move on to his environment:
[root@nas ~]$ su mockbuild
[mockbuild@nas root]$ cd


Install the packages for the assembly and prepare the kernel sources:
[mockbuild@nas ~]$ su -c 'yum install yum-utils rpmdevtools'
[mockbuild@nas ~]$ rpmdev-setuptree
[mockbuild@nas ~]$ yumdownloader --source kernel
[mockbuild@nas ~]$ su -c 'yum-builddep kernel-3.10.10-200.fc19.src.rpm'
[mockbuild@nas ~]$ rpm -Uvh kernel-3.10.10-200.fc19.src.rpm
[mockbuild@nas ~]$ cd ~/rpmbuild/SPECS
[mockbuild@nas ~]$ rpmbuild -bp --target=`uname -m` kernel.spec


Now the patches themselves will be required. Download SCST from the svn repository:
[mockbuild@nas ~]$ svn co https://scst.svn.sourceforge.net/svnroot/scst/trunk scst-svn


Copy the necessary patches to ~ / rpmbuild / SOURCES /
[mockbuild@nas ~]$ cp scst-svn/iscsi-scst/kernel/patches/put_page_callback-3.10.patch ~/rpmbuild/SOURCES/
[mockbuild@nas ~]$ cp scst-svn/scst/kernel/scst_exec_req_fifo-3.10.patch ~/rpmbuild/SOURCES/


Add a line to the kernel config:
[mockbuild@nas ~]$ vim ~/rpmbuild/SOURCES/config-generic
...
CONFIG_TCP_ZERO_COPY_TRANSFER_COMPLETION_NOTIFICATION=y
...


Let's edit kernel.spec.
[mockbuild@nas ~]$ cd ~/rpmbuild/SPECS
[mockbuild@nas ~]$ vim kernel.spec


We change:
#% define buildid .local

On the:
%define buildid .scst


Add our patches, preferably after all the others:
Patch25091: put_page_callback-3.10.patch
Patch25092: scst_exec_req_fifo-3.10.patch

We add the patch application command, it is recommended to add after the remaining entries:
ApplyPatch put_page_callback-3.10.patch
ApplyPatch scst_exec_req_fifo-3.10.patch


After all the actions, we start the assembly of rpm kernel packages with the included firmware files:
[mockbuild@nas ~]$ rpmbuild -bb --with baseonly --with firmware --without debuginfo --target=`uname -m` kernel.spec


After the assembly is complete, install the firmware kernel and kernel header files:
[mockbuild@nas ~]$ cd ~/rpmbuild/RPMS/x86_64/
[mockbuild@nas ~]$ su -c 'rpm -ivh kernel-firmware-3.10.10-200.scst.fc19.x86_64.rpm kernel-3.10.10-200.scst.fc19.x86_64.rpm kernel-devel-3.10.10-200.scst.fc19.x86_64.rpm kernel-headers-3.10.10-200.scst.fc19.x86_64.rpm'


Reboot.

After a successful, I hope the download, go to the directory with the SCST sources and already root, collect the target itself:
[root@nas ~]$ make scst scst_install iscsi iscsi_install scstadm scstadm_install


After the assembly, add the service to autorun:
[root@nas ~]$ systemctl enable "scst.service"


And configure the config in /etc/scst.conf. For example, mine:
[root@nas ~]$ cat /etc/scst.conf
HANDLER vdisk_fileio {
                DEVICE mail {
                filename /dev/mapper/mail
                nv_cache 1
        }
                DEVICE cloud {
                filename /dev/sdb3
                nv_cache 1
        }
                DEVICE vmstore {
                filename /dev/sdb4
                nv_cache 1
        }
}
TARGET_DRIVER iscsi {
	enabled 1
	TARGET iqn.2013-09.local.nas:raid10-ssdcache {
                LUN 0 mail
                LUN 1 cloud
                LUN 2 vmstore
		enabled 1
	}
}


Create files that allow or deny target connections from specific addresses, if you need it:
[root@nas ~]$ cat /etc/initiators.allow
ALL 10.0.0.0/24
[root@nas ~]$ cat /etc/initiators.deny
ALL ALL


After configuring the configuration files, run SCST:
[root@nas ~]$ /etc/init.d/scst start


If everything was done correctly, then the corresponding target will appear in ESXi.

Thank you for reading to the end!

Also popular now: