Ubuntu 18.04 Root on ZFS

  • Tutorial

Last year it took me to create instructions for installing the operating system Ubuntu 18.04. By the way, there is nothing difficult in installing Ubuntu, but there is a nuance: I wanted to use the ZFS file system as the base one. On the one hand, Ubuntu supports ZFS at the kernel level, but there is still no installer for it, but there is an instruction, yes:


https://github.com/zfsonlinux/zfs/wiki/Ubuntu-18.04-Root-on-ZFS


The sequence of actions in this manual is generally correct, but some points require adjustment. So what follows is not a direct translation of instructions, but a free one, taking into account corrections, my experience with ZFS, and others. I also do not consider disk encryption issues and use the MBR bootloader. My installation instructions are available here.



0. Server Preparation


The first thing that is missing in the instructions and is not considered in any way is that ZFS does not work very well with hardware RAID arrays, in particular, this is due to the Write cache, which is understandable: the ZFS file system is journaling and requires full control over the write operations. Also, when using a ready hardware RAID array, the ZFS capabilities in terms of Cache, Spare and others are lost. Therefore, all disks must be transferred to HBA Mode, and if it is impossible to do this, make a separate RAID for each disk and disable the controller's Write Cache.


Also, when using network port aggregation, you can disable them at the installation stage so as not to complicate it (I perform all further operations without bonding).


1. Preparing the installation environment


1.1. Livecd


As mentioned earlier, unfortunately, there is no ready-made Ubuntu installer using root on ZFS, so the installation is done using the LiveCD disc:


Downloading from here: http://releases.ubuntu.com/18.04/ubuntu-18.04.1-desktop-amd64.iso


At the same time, my colleagues and I tried using different disk images as I didn’t really want to use a graphical shell, but this didn’t lead to anything good.

Boot from the LiveCD, select Try Ubuntu and open the terminal (Ctrl + Alt + T).


1.2. Update and install repositories

'
sudo apt-add-repository universe
sudo apt update

Here we are waited by the first bummer if the network settings of the server are not determined by DHCP. Updating repositories will not work, so we will configure the network.

We look at the network interfaces and find the one through which we will connect:


sudo ip a

Configure the network interface:


sudo echo"auto {{ NAME }}" >> /etc/network/interfaces
sudo echo"iface {{ NAME }} inet static" >> /etc/network/interfaces
sudo echo"	address {{ IP }}" >> /etc/network/interfaces
sudo echo"	netmask {{ NETMASK }}" >> /etc/network/interfaces
sudo echo"	gateway {{ GATEWAY }}" >> /etc/network/interfaces
sudo service networking restart

And DNS resolver:


sudo echo'nameserver 8.8.8.8' >> /etc/resolv.conf

Updating repositories:


sudo apt update

1.3. SSH server (optional)


For ease of installation, you can raise the OpenSSH server and perform all further operations via the SSH client


Set the password for the ubuntu user:


passwd

It is important! Otherwise, ssh access will be performed without a password with sudo rights. At the same time you can not set a simple password.

Install and run OpenSSH:


sudo apt install openssh-server
sudo service ssh start

And in the workstation terminal:


ssh ubuntu@{{ ip server }}

1.4. We become root


sudo -s

1.5. Installing ZFS Support in a LiveCD Environment


apt install --yes debootstrap gdisk zfs-initramfs

2. Layout and formatting of hard drives


2.0. We define disk arrays


The main instruction is missing an important point about how to determine disk arrays.


Usually on servers the number of disks is:


  • 2 disks;
  • 4 disks;
  • many disks;

1 disc is not considered because it is generally an anomaly.


2.0.1. 2 disks


Everything is simple, one array MIRROR (RAID1). If there is another third disk, you can put it in a hot spare (SPARE) or build a RAIDZ array (RAID5). But 3 disks in the server, very big rarity.


2.0.2. 4 discs


If all our drives are the same, there are only three options (in principle, I do not consider the fourth RAID0):


  • MIRROR + MIRROR is an analogue of RAID10 or more precisely RAID01, since in ZFS it is a mirror + mirror. 50% of available disk space;
  • RAIDZ is an analogue of RAID5. 75% of available disk space;
  • RAIDZ2 is an analog of RAID6. 50% of available disk space;

In practice, I use the MIRROR + MIRROR array, while obviously, the RAIDZ array is the most beneficial, as it provides more disk space, but there are nuances


In terms of fault tolerance, the arrays are arranged in this order (from best to worst):


  • RAIDZ2 - two disks may be lost, without data loss;
  • MIRROR + MIRROR - one disk can be lost without data loss, and with 66% probability the second disk can be lost without data loss;
  • RAIDZ - only one disk can be lost without data loss;

In terms of speed, arrays are arranged in this order:


  • MIRROR + MIRROR - both in terms of writing and reading;
  • RAIDZ - in terms of writing slower, since in addition to writing, it is required to calculate the checksum;
  • RAIDZ2 - in terms of writing, is even slower as it requires the calculation of more complex checksums;

In terms of the speed of the array with the degradation of one disk:


  • MIRROR + MIRROR - if a single disk fails, in essence, only parallel reading from one mirror is lost, the second mirror works without performance degradation;
  • RAIDZ2 - degradation due to performance degradation is higher as it requires the block recalculation from the checksum for 1/4 of data + block search;
  • RAIDZ - degradation is much greater, since it requires a block recalculation from the checksum for 1/3 of data + block search;

Comparison of characteristics is subjective, but rather reflects my choice as a middle ground.


At the same time, it should be understood that “slower” and “even slower” is not at times, but only by 10-20% in the worst case, so if you have not optimized the database or application for working with disks, then you will In principle, you will not notice. The write speed factor should be considered only when you really need it.


2.0.2. Many disks


The main problem is that if we have a lot of disks and we want to make one common array for everything, then we will need to mark each disk with the boot sector or do some feint with our ears. In practice, for multi-disk platforms, I try to build this configuration:


  • 2 SSD disks - we make a mirror and as the main boot array with the operating system and ZFS cache for the second disk array;
  • The rest is hammered with SATA or SAS disks and without a markup we collect a ZFS disk array;

This also applies to 4-disk servers if we want to get a fairly universal platform;


If the disks are all the same, and it makes no sense to allocate two disks for a separate array (for example, 6 disks of 8 Tb each), you can make the boot disks of the first group of the array. That is, if you are going to make an array like: MIRROR + MIRROR + MIRROR or RAIDZ + RAIDZ, then we only mark the boot sector for the first group. In principle, you can only partition one disk even for MIRROR and RAIDZ, and substitute the rest in a “raw” form, ZFS will make an array on a smaller element itself, but in this case, if the first disk fails, you lose a single boot disk, so you don’t worth doing so.


It is important to understand that in the ZFS file system - stripe is not quite RAID0, and it works a little differently and does not require identical disk sizes, so allocating a small amount of space for the boot sector of the weather doesn’t make much of it, the main thing is to specify the correct disk from which to boot .


2.1. Preparing for partitioning and disk cleanup


To partition the disk, use the mdadm package, put it:


apt install --yes mdadm

We look, what disks we have in stock:


lsblk

And clean them:


sgdisk --zap-all /dev/{{ disk name }}

2.2. Disk partitioning


Actually, the boot partition:


sgdisk -a1 -n1:34:2047 -t1:EF02 /dev/{{ disk name }}

The main section.


There may be variations here: if you need to allocate an additional section of SSD disks, for example, for ZFS Cache or for Aerospike, then the main section is made of a limited volume:

sgdisk -n2:0:+100GB -t2:BF01 /dev/{{ disk name }}
sgdisk -n3:0:0 -t2:BF01 /dev/{{ disk name }}

If we use all the space, we simply create a partition for the rest of the space:


sgdisk -n2:0:0 -t2:BF01 /dev/{{ disk name }}

Do not forget to check how it happened:


lsblk
NAME   MAJ:MIN RM  SIZE RO TYPE MOUNTPOINT
sda      8:0    0  1.8T  0 disk 
├─sda1   8:1    0 1007K  0 part 
└─sda2   8:2    0  1.8T  0 part 
sdb      8:16   0  1.8T  0 disk 
├─sdb1   8:17   0 1007K  0 part 
└─sdb2   8:18   0  1.8T  0 part 
...

2.3. Create ZFS Array


zpool create                	\
    -o ashift=12            	\
    -O atime=off            	\
    -O canmount=off         	\
    -O compression=lz4      	\
    -O checksum=fletcher4   	\
    -O normalization=formD  	\
    -m legacy               	\
    -R /mnt                     \
    -f                      	\
  tank                      	\
    mirror                  	\
      /dev/{{ disk a part 2}}   \
      /dev/{{ disk b part 2}}

The very first rake to which an admin of my friend immediately came in is that when creating a ZFS array, you do not need to specify a disk, but a partition on the disk, if it was created specifically for this.

Next, in order:


  • ashift = 12 - use a block size of 4K, in principle, I still do not understand why in operating systems often the block size is by default 512 bytes when there are practically no such disks;
  • atime = off - turn off the file access date alignment, always turn it off since I never really needed this information and once again the kernel is not required to load it;
  • canmount = off - disable the ability to mount the root partition;
  • compression = lz4 - enable data compression with the LZ4 algorithm. It is recommended to enable this option not only to save disk space, but also to reduce the number of I / O operations. At the same time, for this compression aglo-rhythm, the CPU utilization is extremely low;
  • checksum = fletcher4 - the default checksum algorithm, and so fletcher4 is just to clarify once again;
  • normalization = formD is used to improve work with UTF-8, in fact limiting the possibility of using file names other than UTF-8. Here everyone decides for himself, we always use only UTF-8 encoding in our work;
  • xattr = sa - Acceleration of work with extended attributes. I do not use this option due to the fact that using this option disables compatibility with other OpenZFS implementations (for example: FreeBSD). And compatibility with Windows and other, I need. Especially since this option can be enabled on the final section;
  • -m legacy - mount point to nowhere, and mount the root partition is not necessary;
  • -R / mnt - temporary partition prefix for installing the kernel;
  • -f - force array creation. If before that a ZFS array was compiled on the disks, then the create command will not work, you never know, maybe you made a mistake and you want to erase important data;

Out of habit, I specify the name of the root system disk array as tank, although at present in the Linux environment they prefer to use the name rpool (root pool). In my practice, I generally use the following naming of arrays:


  • tank - the main system array;
  • store - an additional array with large disks for data storage;
  • cache - an additional array of SSD drives, if the primary partition is not on them;

And in general, I highly recommend immediately developing a practice of naming something or something that is not confused.


3. System installation


3.1. and 3.2. Creating a root file system


I specifically combined clauses 3.1. and 3.2. since I believe that specifying the root partition on the third level is absolutely unnecessary. That's true, for several years of working with ZFS, I have never needed to perform any manipulations with the root partition. Moreover, there are photographs with which you can make control points. Therefore, the root section for me is tank / root:

zfs create -o mountpoint=/ tank/root

In this case, the first instruction reveals the first fatal error, namely the absence of specifying a boot partition for the disk array:

zpool set bootfs=tank/root tank

3.3. Creating additional sections


In this part of the main instructions, you can throw out and forget everything. The guys obviously overdid it with crushing and options because of what they had to correct something along the way. True, it did not help much. Since in the future, problems again appear and in the end it turns out that it still does not work, therefore in paragraph 4.11. this is corrected once again.


The selection of a separate section for / var / games looks quite epic. I do not mind, but this is clearly a bust.


The fact that ZFS partitions are created simply and maintains a hierarchy does not mean that you should abandon the classic directories. A simple example: I once had more than 4K ZFS partitions on a server group, it was necessary, but the server reboot slowed down for a few minutes due to the mounting of these partitions.


Let's start with a clean slate.


There are static and dynamic file partitions.


Static file partitions include sections with programs and their settings, they are filled once and are not changed in the process. At the same time, static partitions were previously divided into system and user partitions (/ usr), but at the moment they are mixed in the Linux operating systems and there is no sense in separating them, and it will not work.


The dynamic file sections include sections in which are stored:


  • Temporary data - eq. Tmp, swap;
  • Work Logs - eq .: var / log;
  • User data - eq.: Home;
  • Data - eq .: var / db and how lucky;
  • Other results of the programs in the form of files;

In the Linux families, the dynamic partitions are / tmp and / var, but this is not accurate, as programs and libraries can get into / var / lib, in general, everything is mixed up, but nevertheless ...


First you need to decide whether to create the / tmp partition on disk or in memory as tmpfs. If we create on disk, then we create a separate section for it:


zfs create -o mountpoint=legacy tank/tmp

Options com.sun: auto-snapshot = false setuid = off Well, as if the weather did not, do not need to complicate. But with SWAP we will do later in paragraph 7.

Select the var section separately:


zfs create -o mountpoint=legacy tank/var

And custom sections:


zfs create -o mountpoint=/home tank/home
zfs create -o mountpoint=legacy tank/home/root

It makes sense to allocate user partitions, since in practice they are periodically clogged with various artifacts and to make it easier to monitor them, it is better to create separate partitions for them, as well as the root user home directory (especially for those who like to work as root). The use of quotas on user directories not only does not help not to clutter up disk space, but also hinders, since in such cases users begin to leave artifacts anywhere and find them later is quite difficult. This is not treated, so you only need to control and beat on the hands.

The tank / home / root mount point is listed as legacy, not as / root. This is correct, the way this section is mounted in Section 4.11.


Now we need to temporarily mount our dynamic sections in / mnt:


cd /mnt/
mkdir var tmp root
mount -t zfs tank/var /mnt/var/
mount -t zfs tank/tmp /mnt/tmp/
mount -t zfs tank/home/root /mnt/root/

3.4 Install the kernel


In the main instruction there are a couple more unnecessary commands, we do not pay attention, apparently artifacts of experiments:

debootstrap bionic /mnt

As a result, should get something like this:


zfs list
NAME             USED  AVAIL  REFER  MOUNTPOINT
tank             213M  1.76T    96K  legacy
tank/home        208K  1.76T    96K  /mnt/home
tank/home/root   112K  1.76T   112K  legacy
tank/root        147M  1.76T   147M  /mnt
tank/tmp          96K  1.76T    96K  legacy
tank/var        64.6M  1.76T  64.6M  legacy

The size of the empty 96K partition, respectively, only tank / tmp was left empty, and the rest was recorded during the kernel installation, which means that the partitions were mounted correctly.


4. System configuration


4.1. Configure hosts and hostname


echo HOSTNAME > /mnt/etc/hostname
echo “127.0.0.1 localhost” > /mnt/etc/hosts
echo “127.0.0.1 HOSTNAME” >> /mnt/etc/hosts

4.2. Configure the network interface


Yes, we already have netplan here:

nano /mnt/etc/netplan/setup.yaml
network:
  version: 2
  renderer: networkd
  ethernets:
    eno2:
      dhcp4: no
      dhcp6: no
      addresses: [ {{ IP }}/{{ netmask }}, ]
      gateway4: {{ gateway IP }}
      nameservers:
        addresses: [8.8.8.8]

4.3. Configuring apt repositories


nano /mnt/etc/apt/sources.list
deb http://archive.ubuntu.com/ubuntu/ bionic main restricted universe
deb http://security.ubuntu.com/ubuntu/ bionic-security main restricted universe
deb http://archive.ubuntu.com/ubuntu/ bionic-updates main restricted universe

src - mostly not needed

4.4. We mount virtual file sections LiveCD and “go in” to the new system


mount --rbind /dev  /mnt/dev
mount --rbind /proc /mnt/proc
mount --rbind /sys  /mnt/sys
chroot /mnt /bin/bash --login

it is required to use exactly —rbind, and not —bind

We are already in the new system ...


4.5. Customize the base environment


ln -s /proc/self/mounts /etc/mtab
chmod 1777 /tmp
apt update

Locale and time:


dpkg-reconfigure locales
  * en_US.UTF-8
  * ru_RU.UTF-8
dpkg-reconfigure tzdata

And additional editors who like what:


apt install --yes vim nano

4.6. Installing ZFS support


apt install --yes --no-install-recommends linux-image-generic
apt install --yes zfs-initramfs

4.8. Install the bootloader


As stated earlier, I am using an outdated MBR:


apt install --yes grub-pc

During the installation of the bootloader, it is required to select all of our disks that we have defined as bootable, while the installer scolds all the other disks except the first one, agree and do step 5 (it is not clear why they left the rest for later):

4.8.1. (5.1) Verify that the root file system is recognized:


grub-probe /
zfs

4.8.2. (5.2) Update the initrd


update-initramfs -u -k al

4.8.3. (5.3) Simplify debugging GRUB


vi /etc/default/grub
...
GRUB_CMDLINE_LINUX_DEFAULT=""
GRUB_CMDLINE_LINUX="console"
...

4.8.4. (5.4.) Update Boot Loader Configuration


update-grub

4.8.5. (5.5.) Install the bootloader on each disk that is marked as bootable


grub-install /dev/sda
grub-install /dev/sdb
...

It is important that these commands work correctly. To be honest, I could not get the opposite at least once, so I don’t know what to do, but most likely, if you have an error, then you most likely did something wrong when partitioning the disk (Section 2.2.).

4.8.6. (5.6.) Verify that the ZFS module is installed


ls /boot/grub/*/zfs.mod
/boot/grub/i386-pc/zfs.mod

4.10. Set the root password (complicated!)


passwd

And yes, we will install openssh right away, otherwise we’ll get a surprise after the restart if we work remotely:

apt install --yes openssh-server

Do not forget to correct the sshd configuration:


vi /etc/ssh/sshd_config
...
PermitRootLogin yes
...
PasswordAuthentication yes
...

4.11. Fix file system mount


That got to the most interesting. The fact is that the mounting of ZFS partitions occurs after the start of some daemons (ZFS_INITRD_ADDITIONAL_DATASETS in / etc / default / zfs, we also wobble, but without success), which, in turn, create their own system logs themselves. When it is time to mount the ZFS partitions, it turns out that the mount points are not empty and nothing can be mounted, the data crumbles, everything is bad. Therefore, you need to specify the mount points in / etc / fstab, since systemd primarily focuses on them when accessing the folder:

vi /etc/fstab
tank/var        /var   zfs  noatime,nodev 0 0
tank/tmp        /tmp   zfs  noatime,nodev 0 0
tank/home/root  /root  zfs  noatime,nodev 0 0

The rest is up to item 6. already done

6. First reboot


6.1. Take a snapshot of the root partition


zfs snapshot tank/root@setup

There is no sense to him, in practice I have never once shaken the root partition of the system and have never used pictures of this section, but nevertheless let him lie, it may be useful

6.2. We leave from chroot


exit

6.3. Unmount the LiveCD partitions and export the ZFS array


cd
mount | grep -v zfs | tac | awk '/\/mnt/ {print $3}' | xargs -i{} umount -lf {}
umount /mnt/root
umount /mnt/var
umount /mnt/tmp
zpool export tank

Exporting a disk array is required to clear the zfs cache

6.4 Reboot


It is better to reboot in the LiveCD terminal, since if you are working through the ssh client, rebooting through it can cause the server to freeze.

reboot

If something went wrong and the server did not go to reboot, then you can reboot in any way, since the ZFS array is exported and it is difficult to damage it.

6.5. We are waiting for a reboot and go as root


6.6. Create your user account


zfs create tank/home/{{ LOGIN }}
useradd -u {{ UID }} -G adm,sudo -d /home/{{ LOGIN }}/ -s /bin/bash {{ LOGIN }}
cp -a /etc/skel/.[!.]* /home/{{ LOGIN }}
chown -R {{ LOGIN }}:{{ LOGIN }} /home/{{ LOGIN }}

Add a public ssh key to the user and set the password for it:


su - {{ LOGIN }}
mkdir .ssh
chmod 0700 .ssh
vi .ssh/authorized_keys
exit
passwd {{ LOGIN }}

In OpenSSH, we remove the ability to log in to root and password authentication:

vi /etc/ssh/sshd_config
...
PermitRootLogin no
...
PubkeyAuthentication yes
...
PasswordAuthentication no
...
service ssh restart

6.7. 6.8. No longer required


7. Configure swap


7.1. Create a ZFS partition


zfs create \
    -V 32G \
    -b $(getconf PAGESIZE) \
    -o compression=zle \
    -o logbias=throughput \
    -o sync=always \
    -o primarycache=metadata \
    -o secondarycache=none \
  tank/swap

  • -V 32G - The size of our SWAP, you can determine the one that you really need;
  • -b $ (getconf PAGESIZE) - block size (4K with ashift = 12);
  • compression = zle - choose the compression algorithm that is minimal in terms of resource-intensiveness, in fact, since the block size is 4K, compression as such will not give up input-output utilization, but you can save on zero blocks;
  • logbias = throughput - set the bandwidth to optimize synchronous operations;
  • sync = always - always synchronize the record. This somewhat reduces performance, but fully guarantees the reliability of the data;
  • primarycache = metadata — cache only metadata, since multiple reads of the same block will not be performed from the swap;
  • secondarycache = none - disable the secondary cache altogether for the reasons mentioned above;

7.2. Customize the swap partition


mkswap -f /dev/zvol/tank/swap
echo /dev/zvol/tank/swap none swap defaults 0 0 >> /etc/fstab
echo RESUME=none > /etc/initramfs-tools/conf.d/resume

7.3. Turn on swap


swapon -av

Further, according to the instructions, there is not much interesting, since it strongly depends on the preferences of specific administrators and tasks of the server as a whole, except for one moment, namely: “Emergency loading”

And do not forget to put the FireWall


R. Emergency download


Perform the preparation of the installation environment (p.1.)


During preparation, the ZFS array is imported, so you need to re-import it, but with the correct mount point:


zpool export -a
zpool import -N -R /mnt tank
zfs mount -a

Here, of course, we forgot in the original instruction that we have some partitions mounted via fstab, but we will correct this error:

mount -t zfs tank/var /mnt/var/
mount -t zfs tank/tmp /mnt/tmp/
mount -t zfs tank/home/root /mnt/root/

Further, if required, you can make the chroot as in Section 4.4., But not later forget to unmount everything as in Section 6.3.


D. Dynamic sections


In clause 3.3. We considered the issues of additional sections and simplified the work with them regarding the original instructions. This is primarily due to the fact that I have a different practice of using dynamic sections: for example, I save application logs in the / spool section, and store data in the / data section. And if there is a second ZFS disk array, these partitions are created there.


Summary


  • If you want to use the ZFS file system now, it doesn’t stop you from doing anything, you just need to work a little with your hands;
  • If you have little experience with ZFS, then I would not recommend greatly complicating the work with it in terms of options and unnecessary crushing of partitions, you should not go to extremes. ZFS features and functions are not a replacement for any classic items, but an additional feature;
  • My installation instructions are available here.

Also popular now: