
Linux Kernel 5.0 - writing Simple Block Device under blk-mq
Good News, Everyone!
Linux kernel 5.0 is already here and appears in experimental distributions such as Arch, openSUSE Tumbleweed, Fedora.

And if you look at the RC distributions of Ubuntu Disko Dingo and Red Hat 8, it becomes clear: soon kernel 5.0 will also transfer from fan desktops to serious servers.
Someone will say - so what. The next release, nothing special. So Linus Torvalds himself said:
However, the module for floppy disks (who does not know - these are disks the size of a breast pocket shirt with a capacity of 1.44 MB) - have been fixed ...
And this is why:
It's all about multi-queue block layer (blk-mq). There are plenty of introductory articles about him on the Internet, so let's get right to the point. The transition to blk-mq was started a long time ago and was slowly moving forward. Multi-queue scsi (kernel parameter scsi_mod.use_blk_mq) appeared, new schedulers mq-deadline, bfq and so on appeared…
By the way, what is yours?
The number of block device drivers that work the old fashion was reduced. And in 5.0, the blk_init_queue () function was removed as unnecessary. And now the old glorious code lwn.net/Articles/58720 from 2003 is not only not going to, but also lost relevance. Moreover, the new distributions, which are being prepared for release this year, use a multi-queue block layer in the default configuration. For example, on the 18th Manjaro, the kernel, although version 4.19, is blk-mq by default.
Therefore, we can assume that the transition to blk-mq in kernel 5.0 has completed. And for me this is an important event that will require rewriting the code and additional testing. Which in itself promises the appearance of bugs large and small, as well as several crashed servers (It is necessary, Fedya, it is necessary! (C)).
By the way, if someone thinks that for rhel8 this turning point did not come, since the kernel was "flashed" by version 4.18 there, then you are mistaken. In fresh RC on rhel8, new products from 5.0 had already migrated, and the blk_init_queue () function was also cut out (probably when dragging another check-in from github.com/torvalds/linux to its sources).
In general, the “freeze” version of the kernel for Linux distributors such as SUSE and Red Hat has long been a marketing concept. The system reports that the version, for example, is 4.4, and in fact the functionality is from a fresh 4.8 vanilla. At the same time, an inscription flaunts on the official website like: "In the new distribution, we have kept a stable 4.4 kernel for you."
But we were distracted ...
So. We need a new simple block device driver to make it clearer how this works.
So, the source ongithub.com/CodeImp/sblkdev . I propose to discuss, make pull requests, start issue - I will fix it. QA has not tested yet.
Later in the article I will try to describe what is why. Therefore, there is a lot of code.
I apologize right away that Linux kernel coding style is not fully respected, and yes - I do not like goto.
So, let's start from the entry points.
Obviously, when the module is loaded, the sblkdev_init () function is launched, when sblkdev_exit () is unloaded.
The register_blkdev () function registers a block device. He is allocated a major number. unregister_blkdev () - frees this number.
The key structure of our module is sblkdev_device_t.
It contains all the information about the device necessary for the kernel module, in particular: the capacity of the block device, the data itself (this is simple), pointers to the disk and the queue.
All block device initialization is performed in the sblkdev_add_device () function.
We allocate memory for the structure, allocate a buffer for storing data. Nothing special here.
Next, we initialize the request processing queue either with one blk_mq_init_sq_queue () function, or with two at once: blk_mq_alloc_tag_set () + blk_mq_init_queue ().
By the way, if you look at the source code of the blk_mq_init_sq_queue () function, you will see that this is just a wrapper over the blk_mq_alloc_tag_set () and blk_mq_init_queue () functions, which appeared in the 4.20 kernel. In addition, it hides us many parameters of the queue, but it looks much simpler. You have to choose which option is better, but I prefer a more explicit one.
The key in this code is the global variable _mq_ops.
This is where the function that provides the processing of requests is located, but more about it a little later. The main thing is that we have designated the entry point to the request handler.
Now that we have created the queue, we can create an instance of the disk.
There are no major changes. The disk is allocated, parameters are set, and the disk is added to the system. I want to explain about the parameter disk-> flags. It allows you to tell the system that the disk is removable, or, for example, that it does not contain partitions and you do not need to look for them there.
There is a _fops structure for disk management.
The entry points _open and _release for us for a simple block device module are not very interesting yet. In addition to the atomic increment and decrement counter, there is nothing there. I also left compat_ioctl without implementation, since the version of systems with a 64-bit kernel and a 32-bit user-space environment does not seem promising to me.
But _ioctl allows you to process system requests for this drive. When a disk appears, the system tries to learn more about it. At your own discretion, you can answer some queries (for example, to pretend to be a new CD), but the general rule is this: if you do not want to answer queries that do not interest you, just return the error code -ENOTTY. By the way, if necessary, here you can add your request handlers regarding this particular drive.
So, we added the device - we need to take care of the release of resources. Rust is nothere for you .
In principle, everything is obvious: we delete the disk object from the system and free the queue, after which we also free our buffers (data areas).
And now the most important thing is query processing in the queue_rq () function.
First, consider the parameters. The first is struct blk_mq_hw_ctx * hctx - the state of the hardware queue. In our case, we do without the hardware queue, so unused.
The second parameter is const struct blk_mq_queue_data * bd - a parameter with a very concise structure, which I am not afraid to present to your attention in its entirety:
It turns out that in essence this is all the same request that came to us from times that the chronicler elixir.bootlin.com no longer remembers . So we take the request and start processing it, about which we notify the kernel by calling blk_mq_start_request (). Upon completion of the request processing, we will inform the kernel about this by calling the blk_mq_end_request () function.
Here's a small note: the blk_mq_end_request () function is, in fact, a wrapper over calls to blk_update_request () + __blk_mq_end_request (). When using the blk_mq_end_request () function, you cannot specify how many bytes were actually processed. Believes that everything is processed.
The alternative option has another feature: the blk_update_request function is exported only for GPL-only modules. That is, if you want to create a proprietary kernel module (let PM save you from this thorny path), you cannot use blk_update_request (). So the choice is yours.
Directly shifting the bytes from the request to the buffer and vice versa I put into the do_simple_request () function.
There is nothing new: rq_for_each_segment iterates over everything bio, and they all have bio_vec structures, allowing us to get to the pages with the request data.
What are your impressions? Everything seems simple? Request processing in general is simply copying data between the pages of the request and the internal buffer. Quite worthy for a simple block device driver, right?
But there is a problem: This is not for real use!
The essence of the problem is that the queue_rq () request processing function is called in a loop that processes requests from the list. I don’t know which lock for this list is used there, Spin or RCU (I don’t want to lie - who knows, correct me), but when I try to use, for example, mutex in the request processing function, the debugging kernel swears and warns: doze off here it is impossible. That is, using conventional synchronization tools or virtual contiguous memory - one that is allocated using vmalloc and can fall into swap with all that it implies - is impossible, since the process cannot go into standby state.
Therefore, either only Spin or RCU locks and a buffer in the form of an array of pages, or a list, or a tree, as implemented in .. \ linux \ drivers \ block \ brd.c, or deferred processing in another thread, as implemented in .. \ linux \ drivers \ block \ loop.c.
I think there is no need to describe how to assemble the module, how to load it into the system and how to unload. There are no new products on this front, and thanks for that :) So if someone wants to try it, I’ll be sure to figure it out. Just don't do it right away on your favorite laptop! Raise a virtualochka or at least make a backup on a ball.
By the way, Veeam Backup for Linux 3.0.1.1046 is already available. Just do not try to run VAL 3.0.1.1046 on a kernel 5.0 or later. veeamsnap will not assemble. And some multi-queue innovations are still at the testing stage.
Linux kernel 5.0 is already here and appears in experimental distributions such as Arch, openSUSE Tumbleweed, Fedora.

And if you look at the RC distributions of Ubuntu Disko Dingo and Red Hat 8, it becomes clear: soon kernel 5.0 will also transfer from fan desktops to serious servers.
Someone will say - so what. The next release, nothing special. So Linus Torvalds himself said:
I'd like to point out (yet again) that we don't do feature-based releases, and that “5.0” doesn't mean anything more than that the 4.x numbers started getting big enough that I ran out of fingers and toes.
( I repeat once again - our releases are not tied to any specific features, so the number of the new version 5.0 means only that for numbering versions 4.x I already do not have enough fingers and toes )
However, the module for floppy disks (who does not know - these are disks the size of a breast pocket shirt with a capacity of 1.44 MB) - have been fixed ...
And this is why:
It's all about multi-queue block layer (blk-mq). There are plenty of introductory articles about him on the Internet, so let's get right to the point. The transition to blk-mq was started a long time ago and was slowly moving forward. Multi-queue scsi (kernel parameter scsi_mod.use_blk_mq) appeared, new schedulers mq-deadline, bfq and so on appeared…
[root@fedora-29 sblkdev]# cat /sys/block/sda/queue/scheduler
[mq-deadline] none
By the way, what is yours?
The number of block device drivers that work the old fashion was reduced. And in 5.0, the blk_init_queue () function was removed as unnecessary. And now the old glorious code lwn.net/Articles/58720 from 2003 is not only not going to, but also lost relevance. Moreover, the new distributions, which are being prepared for release this year, use a multi-queue block layer in the default configuration. For example, on the 18th Manjaro, the kernel, although version 4.19, is blk-mq by default.
Therefore, we can assume that the transition to blk-mq in kernel 5.0 has completed. And for me this is an important event that will require rewriting the code and additional testing. Which in itself promises the appearance of bugs large and small, as well as several crashed servers (It is necessary, Fedya, it is necessary! (C)).
By the way, if someone thinks that for rhel8 this turning point did not come, since the kernel was "flashed" by version 4.18 there, then you are mistaken. In fresh RC on rhel8, new products from 5.0 had already migrated, and the blk_init_queue () function was also cut out (probably when dragging another check-in from github.com/torvalds/linux to its sources).
In general, the “freeze” version of the kernel for Linux distributors such as SUSE and Red Hat has long been a marketing concept. The system reports that the version, for example, is 4.4, and in fact the functionality is from a fresh 4.8 vanilla. At the same time, an inscription flaunts on the official website like: "In the new distribution, we have kept a stable 4.4 kernel for you."
But we were distracted ...
So. We need a new simple block device driver to make it clearer how this works.
So, the source ongithub.com/CodeImp/sblkdev . I propose to discuss, make pull requests, start issue - I will fix it. QA has not tested yet.
Later in the article I will try to describe what is why. Therefore, there is a lot of code.
I apologize right away that Linux kernel coding style is not fully respected, and yes - I do not like goto.
So, let's start from the entry points.
static int __init sblkdev_init(void)
{
int ret = SUCCESS;
_sblkdev_major = register_blkdev(_sblkdev_major, _sblkdev_name);
if (_sblkdev_major <= 0){
printk(KERN_WARNING "sblkdev: unable to get major number\n");
return -EBUSY;
}
ret = sblkdev_add_device();
if (ret)
unregister_blkdev(_sblkdev_major, _sblkdev_name);
return ret;
}
static void __exit sblkdev_exit(void)
{
sblkdev_remove_device();
if (_sblkdev_major > 0)
unregister_blkdev(_sblkdev_major, _sblkdev_name);
}
module_init(sblkdev_init);
module_exit(sblkdev_exit);
Obviously, when the module is loaded, the sblkdev_init () function is launched, when sblkdev_exit () is unloaded.
The register_blkdev () function registers a block device. He is allocated a major number. unregister_blkdev () - frees this number.
The key structure of our module is sblkdev_device_t.
// The internal representation of our device
typedef struct sblkdev_device_s
{
sector_t capacity; // Device size in bytes
u8* data; // The data aray. u8 - 8 bytes
atomic_t open_counter; // How many openers
struct blk_mq_tag_set tag_set;
struct request_queue *queue; // For mutual exclusion
struct gendisk *disk; // The gendisk structure
} sblkdev_device_t;
It contains all the information about the device necessary for the kernel module, in particular: the capacity of the block device, the data itself (this is simple), pointers to the disk and the queue.
All block device initialization is performed in the sblkdev_add_device () function.
static int sblkdev_add_device(void)
{
int ret = SUCCESS;
sblkdev_device_t* dev = kzalloc(sizeof(sblkdev_device_t), GFP_KERNEL);
if (dev == NULL) {
printk(KERN_WARNING "sblkdev: unable to allocate %ld bytes\n", sizeof(sblkdev_device_t));
return -ENOMEM;
}
_sblkdev_device = dev;
do{
ret = sblkdev_allocate_buffer(dev);
if(ret)
break;
#if 0 //simply variant with helper function blk_mq_init_sq_queue. It`s available from kernel 4.20 (vanilla).
{//configure tag_set
struct request_queue *queue;
dev->tag_set.cmd_size = sizeof(sblkdev_cmd_t);
dev->tag_set.driver_data = dev;
queue = blk_mq_init_sq_queue(&dev->tag_set, &_mq_ops, 128, BLK_MQ_F_SHOULD_MERGE | BLK_MQ_F_SG_MERGE);
if (IS_ERR(queue)) {
ret = PTR_ERR(queue);
printk(KERN_WARNING "sblkdev: unable to allocate and initialize tag set\n");
break;
}
dev->queue = queue;
}
#else // more flexible variant
{//configure tag_set
dev->tag_set.ops = &_mq_ops;
dev->tag_set.nr_hw_queues = 1;
dev->tag_set.queue_depth = 128;
dev->tag_set.numa_node = NUMA_NO_NODE;
dev->tag_set.cmd_size = sizeof(sblkdev_cmd_t);
dev->tag_set.flags = BLK_MQ_F_SHOULD_MERGE | BLK_MQ_F_SG_MERGE;
dev->tag_set.driver_data = dev;
ret = blk_mq_alloc_tag_set(&dev->tag_set);
if (ret) {
printk(KERN_WARNING "sblkdev: unable to allocate tag set\n");
break;
}
}
{//configure queue
struct request_queue *queue = blk_mq_init_queue(&dev->tag_set);
if (IS_ERR(queue)) {
ret = PTR_ERR(queue);
printk(KERN_WARNING "sblkdev: Failed to allocate queue\n");
break;
}
dev->queue = queue;
}
#endif
dev->queue->queuedata = dev;
{// configure disk
struct gendisk *disk = alloc_disk(1); //only one partition
if (disk == NULL) {
printk(KERN_WARNING "sblkdev: Failed to allocate disk\n");
ret = -ENOMEM;
break;
}
disk->flags |= GENHD_FL_NO_PART_SCAN; //only one partition
//disk->flags |= GENHD_FL_EXT_DEVT;
disk->flags |= GENHD_FL_REMOVABLE;
disk->major = _sblkdev_major;
disk->first_minor = 0;
disk->fops = &_fops;
disk->private_data = dev;
disk->queue = dev->queue;
sprintf(disk->disk_name, "sblkdev%d", 0);
set_capacity(disk, dev->capacity);
dev->disk = disk;
add_disk(disk);
}
printk(KERN_WARNING "sblkdev: simple block device was created\n");
}while(false);
if (ret){
sblkdev_remove_device();
printk(KERN_WARNING "sblkdev: Failed add block device\n");
}
return ret;
}
We allocate memory for the structure, allocate a buffer for storing data. Nothing special here.
Next, we initialize the request processing queue either with one blk_mq_init_sq_queue () function, or with two at once: blk_mq_alloc_tag_set () + blk_mq_init_queue ().
By the way, if you look at the source code of the blk_mq_init_sq_queue () function, you will see that this is just a wrapper over the blk_mq_alloc_tag_set () and blk_mq_init_queue () functions, which appeared in the 4.20 kernel. In addition, it hides us many parameters of the queue, but it looks much simpler. You have to choose which option is better, but I prefer a more explicit one.
The key in this code is the global variable _mq_ops.
static struct blk_mq_ops _mq_ops = {
.queue_rq = queue_rq,
};
This is where the function that provides the processing of requests is located, but more about it a little later. The main thing is that we have designated the entry point to the request handler.
Now that we have created the queue, we can create an instance of the disk.
There are no major changes. The disk is allocated, parameters are set, and the disk is added to the system. I want to explain about the parameter disk-> flags. It allows you to tell the system that the disk is removable, or, for example, that it does not contain partitions and you do not need to look for them there.
There is a _fops structure for disk management.
static const struct block_device_operations _fops = {
.owner = THIS_MODULE,
.open = _open,
.release = _release,
.ioctl = _ioctl,
#ifdef CONFIG_COMPAT
.compat_ioctl = _compat_ioctl,
#endif
};
The entry points _open and _release for us for a simple block device module are not very interesting yet. In addition to the atomic increment and decrement counter, there is nothing there. I also left compat_ioctl without implementation, since the version of systems with a 64-bit kernel and a 32-bit user-space environment does not seem promising to me.
But _ioctl allows you to process system requests for this drive. When a disk appears, the system tries to learn more about it. At your own discretion, you can answer some queries (for example, to pretend to be a new CD), but the general rule is this: if you do not want to answer queries that do not interest you, just return the error code -ENOTTY. By the way, if necessary, here you can add your request handlers regarding this particular drive.
So, we added the device - we need to take care of the release of resources. Rust is not
static void sblkdev_remove_device(void)
{
sblkdev_device_t* dev = _sblkdev_device;
if (dev){
if (dev->disk)
del_gendisk(dev->disk);
if (dev->queue) {
blk_cleanup_queue(dev->queue);
dev->queue = NULL;
}
if (dev->tag_set.tags)
blk_mq_free_tag_set(&dev->tag_set);
if (dev->disk) {
put_disk(dev->disk);
dev->disk = NULL;
}
sblkdev_free_buffer(dev);
kfree(dev);
_sblkdev_device = NULL;
printk(KERN_WARNING "sblkdev: simple block device was removed\n");
}
}
In principle, everything is obvious: we delete the disk object from the system and free the queue, after which we also free our buffers (data areas).
And now the most important thing is query processing in the queue_rq () function.
static blk_status_t queue_rq(struct blk_mq_hw_ctx *hctx, const struct blk_mq_queue_data* bd)
{
blk_status_t status = BLK_STS_OK;
struct request *rq = bd->rq;
blk_mq_start_request(rq);
//we cannot use any locks that make the thread sleep
{
unsigned int nr_bytes = 0;
if (do_simple_request(rq, &nr_bytes) != SUCCESS)
status = BLK_STS_IOERR;
printk(KERN_WARNING "sblkdev: request process %d bytes\n", nr_bytes);
#if 0 //simply and can be called from proprietary module
blk_mq_end_request(rq, status);
#else //can set real processed bytes count
if (blk_update_request(rq, status, nr_bytes)) //GPL-only symbol
BUG();
__blk_mq_end_request(rq, status);
#endif
}
return BLK_STS_OK;//always return ok
}
First, consider the parameters. The first is struct blk_mq_hw_ctx * hctx - the state of the hardware queue. In our case, we do without the hardware queue, so unused.
The second parameter is const struct blk_mq_queue_data * bd - a parameter with a very concise structure, which I am not afraid to present to your attention in its entirety:
struct blk_mq_queue_data {
struct request *rq;
bool last;
};
It turns out that in essence this is all the same request that came to us from times that the chronicler elixir.bootlin.com no longer remembers . So we take the request and start processing it, about which we notify the kernel by calling blk_mq_start_request (). Upon completion of the request processing, we will inform the kernel about this by calling the blk_mq_end_request () function.
Here's a small note: the blk_mq_end_request () function is, in fact, a wrapper over calls to blk_update_request () + __blk_mq_end_request (). When using the blk_mq_end_request () function, you cannot specify how many bytes were actually processed. Believes that everything is processed.
The alternative option has another feature: the blk_update_request function is exported only for GPL-only modules. That is, if you want to create a proprietary kernel module (let PM save you from this thorny path), you cannot use blk_update_request (). So the choice is yours.
Directly shifting the bytes from the request to the buffer and vice versa I put into the do_simple_request () function.
static int do_simple_request(struct request *rq, unsigned int *nr_bytes)
{
int ret = SUCCESS;
struct bio_vec bvec;
struct req_iterator iter;
sblkdev_device_t *dev = rq->q->queuedata;
loff_t pos = blk_rq_pos(rq) << SECTOR_SHIFT;
loff_t dev_size = (loff_t)(dev->capacity << SECTOR_SHIFT);
printk(KERN_WARNING "sblkdev: request start from sector %ld \n", blk_rq_pos(rq));
rq_for_each_segment(bvec, rq, iter)
{
unsigned long b_len = bvec.bv_len;
void* b_buf = page_address(bvec.bv_page) + bvec.bv_offset;
if ((pos + b_len) > dev_size)
b_len = (unsigned long)(dev_size - pos);
if (rq_data_dir(rq))//WRITE
memcpy(dev->data + pos, b_buf, b_len);
else//READ
memcpy(b_buf, dev->data + pos, b_len);
pos += b_len;
*nr_bytes += b_len;
}
return ret;
}
There is nothing new: rq_for_each_segment iterates over everything bio, and they all have bio_vec structures, allowing us to get to the pages with the request data.
What are your impressions? Everything seems simple? Request processing in general is simply copying data between the pages of the request and the internal buffer. Quite worthy for a simple block device driver, right?
But there is a problem: This is not for real use!
The essence of the problem is that the queue_rq () request processing function is called in a loop that processes requests from the list. I don’t know which lock for this list is used there, Spin or RCU (I don’t want to lie - who knows, correct me), but when I try to use, for example, mutex in the request processing function, the debugging kernel swears and warns: doze off here it is impossible. That is, using conventional synchronization tools or virtual contiguous memory - one that is allocated using vmalloc and can fall into swap with all that it implies - is impossible, since the process cannot go into standby state.
Therefore, either only Spin or RCU locks and a buffer in the form of an array of pages, or a list, or a tree, as implemented in .. \ linux \ drivers \ block \ brd.c, or deferred processing in another thread, as implemented in .. \ linux \ drivers \ block \ loop.c.
I think there is no need to describe how to assemble the module, how to load it into the system and how to unload. There are no new products on this front, and thanks for that :) So if someone wants to try it, I’ll be sure to figure it out. Just don't do it right away on your favorite laptop! Raise a virtualochka or at least make a backup on a ball.
By the way, Veeam Backup for Linux 3.0.1.1046 is already available. Just do not try to run VAL 3.0.1.1046 on a kernel 5.0 or later. veeamsnap will not assemble. And some multi-queue innovations are still at the testing stage.