1cloud June 5, 2019 at 19:28

A bug in Linux 5.1 led to data loss - a correction patch has already been released

A couple of weeks ago, a bug was discovered in the Linux 5.1 kernel version that led to data loss on the SSD. Recently, developers released the Linux 5.1.5 patch patch, which filled a “gap”.

We discuss what was the reason. / Unsplash / Glen Carrie

What a bug

At the beginning of the year, developers made a number of changes to the Linux 5.1 kernel. After that, on systems with SSDs from Samsung that use dm-crypt / LUKS encryption with device-mapper / LVM, an error began to appear , leading to data loss. But the problem became known only in mid-May - at that time they began to actively discuss it at thematic forums .

Aware of at least two people, faced with a bug - it's party mailing LKML Michael Lass (Michael Laß), which first reported the problem , and the user ArchLinux.

Michael launchedfstrim command, which tells the drive which data blocks are no longer in use for the mounted btrfs volume. After he received the following system messages:

attempt to access beyond end of device
sda1: rw=16387, want=252755893, limit=250067632
BTRFS warning (device dm-5): failed to trim 1 device(s), last error -5
BTRFS warning (device dm-5): csum failed root 257 ino 16634085 off 21504884736 csum 0xd47cc2a2 expected csum 0xcebd791b mirror 1

After that, he discovered that the btrfs volume was corrupted and the remaining logical volumes on the physical device were destroyed.

In the case of the ArchLinux user, the problem affected the cryptographic protection of LUKS. After rebooting the operating system and running fstrim, the LUKS headers (which are used to search for volumes) turned out to be unreadable, which did not allow decrypted encrypted data.

What is the reason

The problem was the device mapper (DM) subsystem , whose task was to create virtual block devices. It is just used to implement the LVM logical volume manager, software RAID, and dm-crypt disk encryption system.

“The fstrim team marked too many blocks at a time without taking into account the max_io_len_target_boundary limit. As a result, those memory segments that are still in use are freed up, ”comments Sergey Belkin, head of development department 1cloud.ru . “Since the error was related to device mapper, in theory, data loss could occur on any file system.”

Patch

Kernel developers released a patch for the bug at the end of May. Only four lines in the drivers / md / dm.c file were changed . The corresponding changes were also made to the upcoming Linux 5.2 kernel (added and deleted lines are marked with “+” and “-”, respectively):

@@ -1467,7 +1467,7 @@ static unsigned get_num_write_zeroes_bios(struct dm_target *ti)
 static int __send_changing_extent_only(struct clone_info *ci, struct dm_target *ti,
 				       unsigned num_bios)
 {
- unsigned len = ci->sector_count;
+  unsigned len;
@@ -1478,6 +1478,8 @@ static int __send_changing_extent_only(struct clone_info *ci, struct dm_target *
 	if (!num_bios)
 		return -EOPNOTSUPP;
+  len = min((sector_t)ci->sector_count, max_io_len_target_boundary(ci->sector, ti));
+
 	__send_duplicate_bios(ci, ti, num_bios, &len);
 	ci->sector += len;

The patch has already been applied by ArchLinux / Manjaro and Fedora distribution developers . The Ubuntu distribution did not affect the error , since it was not translated to the Linux 5.1 kernel version.

/ Flickr / Andy Melton / CC BY-SA You can

eliminate the situation with data loss without installing the patch. It is enough to disable the fstrim.service / timer service using the commands:

systemctl disable fstrim.timer
systemctl stop fstrim.timer

Another option is to rename the fstrim executable or remove the discard flag when mounting fstab. You can also turn off allow-discards in LUKS via dmsetup. However, all these methods are nothing more than temporary and do not solve the essence of the problem.

Not the first time

This is not the first time that a commit in the Linux kernel leads to memory corruption situations. A similar story happened in Linux version 4.19 - then the BLK-MQ I / O schedulers were to blame. The problem was manifested when building the kernel with the CONFIG_SCSI_MQ_DEFAULT = y option, set by default. In some cases, the volume data was corrupted.

sed: error while loading shared libraries: /lib/x86_64-linux-gnu/libattr.so.1: unexpected PLT reloc type 0x00000107
sed: error while loading shared libraries: /lib/x86_64-linux-gnu/libattr.so.1: unexpected PLT reloc type 0x00000107

Most often, the problem manifested itself with EXT4, but in theory it could affect other file systems.

Then one of the kernel maintainers prepared a small fix that solved the problem. However, the same bug was later discovered in the Linux 4.20 build. It was finally possible to get rid of it at the end of December 2018 with a new global update.

^{Our additional resources and sources:

File backup: how to be safe from data loss
Minimizing risks: how not to lose your data
Backup & Recovery: streaming and smart deduplication, snapshots and secondary storage
How to save money using the
DevOps application programming interface in the cloud service using 1cloud.ru as an example
The evolution of 1cloud cloud architecture

How it works : the 1cloud digest
Potential attacks on HTTPS and ways to protect against them}

Tags:

A bug in Linux 5.1 led to data loss - a correction patch has already been released

What a bug

What is the reason

Patch

Not the first time

Also popular now: