pfactum November 23, 2010 at 00:51

Interview with reiser4 developer Eduard Shishkin

Due to the fact that Edward is a busy man, the epic from the interview stretched out indefinitely. But, in spite of everything, the reiser4 developer took the time and answered questions from the respected community of Habr and ENT. What came of it - read under the cut.

- What about the promotion of reiser4 in the kernel?

I no longer see technical obstacles for this: all the problems from the famous “list for inclusion” have been resolved. It remains only to clarify the relationship with VFS, and the corresponding article for publication is not yet ready.

In general, promoting reiser4 to the Linux kernel is now a low priority. Simple, then you will need to immediately respond to all changes in the VFS / block layer. And I do not always have such an opportunity. In the -mm branch, no one requires this from me. If something breaks, Andrew Morton just sends a notification. And when I find time, I correct it.

Regarding popular predictions that “reiser4 will not be included in the kernel and it will die,” I want to say that I do not understand the obsession with a “ticket to life” supposedly given by including the project in the main Linux kernel branch. Reiser4 is the result of 18 years of research in the field of data storage, not tied to a specific operating system. The result, on which many scientists worked. They will not turn it on in Linux - they will turn it on in another OS, where our ideas will seem interesting. On Linux, the light did not converge wedge ...

- Does it make sense to conduct something like an advertising campaign about reiser4 to improve its image?

The best advertising campaign is to explain to people how it works. For everyone is looking at her code, and no one understands anything. Here you have Open Source. How to explain this? Only articles published in reputable publications. And, of course, there can be no talk of any Wikipedia. Wikipedia is good for covering the works of Renaissance artists. And the page of your project here risks becoming a latrine for competitors.

For my part, I am going to publish a couple of articles. The first will be about modular architecture itself, the second will be entirely devoted to the dancing tree module (the only TREE interface plugin available so far)) It will be very interesting: the technique used in this module is one of the most sophisticated. Further, it would be nice to explain how the transaction manager and other plugins work, but you can also understand this from the code ( if you can’t say about the tree ).

- What do you personally think about BFS ( CPU-scheduler ), BFQ ( I / O-scheduler )?

To be honest, I haven’t been following planners for a long time: on my laptop, besides a text editor and a browser, I’m running a little. I just remember that the appearance of BFS was preceded by a rather unpleasant story (by the way, characteristic of the Linux kernel development model) And Hans used to be very interested in elevators, constantly instructed to implement his various ideas. About ten years ago, I also modified some kind of elevator on his instructions. True, he did not work better from this. Maybe because I was not interested in elevators ...

- What do you think, why the more raw and unreliable FS ext4 was almost immediately accepted into the kernel?

Well, this is a logical extension of the de facto standard Linux ext3 file system. It would be surprising if they had a red light here.

- How do you feel about that huge amount of FS in the core? Is it justified?

Of course, not justified. To a large extent, this zoo is facilitated by the outdated VFS concept, which considers the file system as an opaque monolithic module. Previously, there simply weren’t so many FS. And now, only the lazy one will understand that many of them are doing the same thing. It’s time to draw some conclusions. I have a number of suggestions for improving the situation ( everything will be in the article ).

- At what exhibitions and forums are you going to speak this year and next?

They haven’t invited anyone yet. I myself never show initiative.

- What motivates you to develop reiser4? After all, there are many other FS.

There is no strong motivation. At first, I wanted, finally, to complete the transparent compression. Then, after her announcement in 2007, I was busy with Reiser4 because of nothing to do: I could not find a job for a long time. Now I continue to be interested in some aspects of the data storage science on which Reiser4 is based. The rest of the local FSs are not interesting to me.

- Do you continue informally, "behind the scenes", to communicate with Hans? Does he take any part in the development?

It is no longer possible to communicate in a different way. He completely moved away from computer science, although he is trying to provide all possible assistance, but for full participation he needs a computer that he is not supposed to have. And since Hans cannot sit idle, he buried himself in books and set about his old hobby - physics. So, I found some inconsistencies in the special theory of relativity, asks me to find a Russian scientist who would review his new article. He curses America, "the country of lawyers, where science is not put in anything." He recalls Russia with warmth. He follows with interest the initiatives of Minister Andrei Fursenko, who, in his opinion, is trying to revive the former prestige of Soviet science and education. He believes that there will be a place for foreigners in his project, and says that he is generally ready to move to Russia and finally learn the Russian language.

- What is your main job?

I work in the Red Hat file system division.

- Is there enough time to do reiser4?

Relatively speaking, it’s enough, but only for support: I am usually up to date with all the changes carried out in VFS, block layer. To adapt reiser4 to them, a weekend is usually enough. Development means programming new plugins. This implies full employment, no less. Those. this is only possible for a salary, and so far no one is going to pay for it.

- How many people are involved in the support? Do you have a successor if you suddenly had to give up support for reiser4?

I'm alone for now. All previous developers went to work, but there are no new ones. Diving into this area is not easy. This is the whole day you have to sit and pore over the monitor. In blooming years, usually not before. Well, when a person is already over thirty, he wants stable work and money. Where will I get them?

- Is it possible to license the reiser4 code for use in a proprietary OS?

I am far from such questions. You can ask Hans if it’s very interesting.

- Can reiser4 become the default FS in one of the next releases of RHEL?

This is a question, rather, to my manager: I can not discuss the plans of Red Hat. I can only say that so far I have not offered anything to anyone, but no one has asked me anything.

- Do you plan to port reiser4 to FreeBSD? Perhaps you should consider creating a port using FUSE? What do you think of the policy of accepting changes to the core?

In general, porting as such was never interesting to me. But I heard that FreeBSD is an operating system that has academic roots ( University of California, Berkeley ). And this means that with a high degree of probability we will find a common language with the developers. In any case, they will not look at you with a misunderstanding at the word “algorithm”. In Linux, the key concept is the concept of a patch. And there is a committee of certain people who decide ( based probably on their own intuition, as well as heavily on the ability of the author of the patch to “get along” with the kernel development team), they will accept this patch in the kernel, or not. I do not like this approach: I graduated from the Moscow State University, and not MGIMO.

- What “pitfalls" may a person want to try reiser4 in everyday work? How do you rate its stability?

General comment: over the past four years, I don’t remember anyone losing data on the reiser4 partition with properly functioning hardware. I was approached by several people complaining about the work of fsck. In the end, they all got their data and working fsck.

The most unpleasant thing is that it may be necessary to roll back to the previous version of the kernel after the upgrade ( I do not test patches very well for the next version) The next nuisance is the lack of a defragmentation utility. Also still lives an old hard-to-reproduce bug, leading to reports of "key inconsistency". In any case, if you decide to contact reiser4, then you definitely need to be patient. If you have problems, then you need to send a bug report to the mailing list, or directly to me ( if you do not know English ). You don’t have to think that I will solve them instantly: on reiser4 I have time only after work and the weekend. If I stopped responding to letters - do not be shy to remind myself again. Well, complaining about forums is the most inefficient way to solve problems.

- Do you plan to create a utility for defragmentation? For example, when using reiser4 on a section with torrents, it turned out more than 11000 fragments per 700-megabyte file, and no copy was able to bring down this figure to at least several hundred. At the same time, there were tangible negative consequences for productivity.

Yes, it’s planned. Having such a utility is important. Reiser4 Transaction Manager uses a mix of logging and copy-on-write techniques. The latter in itself already means fragmentation. In order to get rid of it, a single copy may not be enough: after all, free disk space can also be fragmented. In general, the defragmentation utility will significantly improve the situation in several passes of the tree. One can fight external fragmentation - this is not a sentence for the FS.

About the torrents. About three years ago, the Linux system call fallocate (2) appeared, which is designed to prevent fragmentation in such cases. The application must indicate in advance the offset and size of the piece in the file, and the file system should allocate for this piece ( as little fragmented as possible) disk space. However, reiser4 does not yet support this system call. It is not difficult to make such support, but in the near future I most likely will not be up to it.

- Are there any problems with specific hardware when using reiser4?

I have not heard of such. It seems like an omnivore.

- Will reiser4 support for grub2 be implemented by reiser4 developers themselves?

I hope that will be. This is painstaking work, but it is guaranteed to succeed. There is a patch for grub-0.97. Based on it, miraculously, you can organize reiser4 support for grub2. The disadvantage of the existing patch is that the download cannot go through stage1_5 for the reason that the corresponding binary is too large and does not fit in the 62 sectors allocated to it. And the inability to boot through stage1_5 means that every time a defragmenter works on your partition, you need to reinstall grub. In reiser4 support for grub2, everything should be done well. The module loading btrfs from multi-devices fit in my 62 sectors. Why doesn't reiser4 fit there?

- Is it possible to remove plug-ins in userspace in the future? Are there any plans at all? Are you planning to create an infrastructure that can load plug-ins both in kernel space and in user space?

The removal of individual plugins in userspace does not make much sense. How will they interact with each other? Each plugin performs a service and, in turn, asks other plugins for a service. Imagine that the X plugin running in the kernel needed some kind of operational service, and the Y plugin that provided it works in userspace. Nothing good will come of it? Dynamic loading of plugins as kernel modules is useful, but this is not an interesting and burning issue. Well, let's make them dynamically loadable ...

- Is it worth thinking about writing a set of tests that will test the FS for strength in various ways and show problems? For example, it could be a set of perl scripts that would conduct aggressive parallel writes, reads and deletes, show the read data correspondence with written data, and also check the structure of the file system for problems.

It is a dream of many to have such tests. So that after half an hour of their run, it would be possible to confidently make the next release. I can only say that everything is very difficult here. Writing comprehensive tests to identify problem areas in software products is a very difficult task. Yes, and the test will rest mainly in the regression of other kernel subsystems. And either correct them, or wait for someone else to correct them.

- How did zfs / btrfs affect reiser4?

No way. Reiser4 was partially influenced by the development of xfs (the “delayed allocation” technique ). Basically, they used their own developments.

- Are you directly involved in the development of btrfs?

Partly on behalf of the employer. I made btrfs support in grub-0.97 ( our distributions do not work with grub2 ). I don’t know what else will be entrusted. It is possible that the trendy feature is “data deduplication”.

- What is your opinion on the current state of affairs with btrfs following the recent sensational correspondence with Mason?

Why "sensational"? Normal work environment. I was commissioned to investigate btrfs for its applicability in enterprise systems, so I found strong internal fragmentation on those models where other FSs work flawlessly. Accordingly, I began to find out whether this is a mistake, or a “feature”. True, half a year has already passed and I still have not heard anything intelligible about btrfs algorithms. What opinion can there be? I just realized that they want the tail packing feature of the reiserfs file system, completely not understanding how the algorithms and data structures of the latter work. I can only say that in B-trees the concept of “tail packing” is completely devoid of any meaning. And, moreover, an attempt to place items of variable size in such trees leads to unlimited internal fragmentation. And Reiserfs does not useB-trees and their well-known modifications. There, completely different algorithms are used (the invention of Russian scientists, by the way ) - with them in the early 90s the history of Namesys began. And modifying them for top-down balancing, as required by the btrfs design, is not a trivial task, unlike classic B-trees.

Very often I hear that btrfs maintainer Chris Mason, having worked at Namesys at one time, as Duncan MacLeod borrowed all the positive experience from there. For the time being, I see only the opposite. For some reason, he saved on keys (the key in btrfs is 136 bits, in reiser4 - 192 bits ), but terabytes of disk space ( and RAM)) users unsuccessful balancing derailed. Additional key fields are the ability to group data and metadata in different ways. And what, all this is not necessary ??? And balancing from top to bottom is, in my opinion, a complete compromise: the squeeze phase of balancing, as well as data compression and encryption, cannot be postponed like the delayed allocation technique. And then, it seems to me that these guys will run into problems with scalability due to the inability to organize a decent lock scheme on such a tree. I can only say that it is much more profitable to distribute “woodwork” among a large number of processes, and let some of them run towards ( bottom to top ), and not so that they all break into this tree from above through a common root.

In general, I don’t know ... I, of course, will help as I can, but here it is: if the project is based on bad ideas, it’s hard to make candy out of it. By the way, the whole history of Namesys is continuous contacts with academic institutions ( Moscow State University, Institute of Software Systems of the Russian Academy of Sciences in Pereslavl-Zalessky ). XFS is also a whole school in Silicon Graphics. And Btrfs is the story of what? Pairs of low level workshops? And what else to name the events at which non-existent features are announced? I have long ceased to believe in miracles ...

- How do you see the future of FS? What functionality will they have?

A file system is a subsystem that manages the disk space resource. And all its “features” should be aimed at the effective management of this resource. And this means that the future of file systems is with more advanced algorithms, i.e. for those that do the job better. However, there are new physical media, read-write technologies, some are moving from userspace to the kernel ( atomicity, transparent compression, encryption, etc. ). Existing file systems are becoming obsolete: they are cheaper to rewrite than to adapt to innovations. File systems should be able to "meet" such features. Do not rewrite them each time again ... And for this they must have the appropriate technical base.

An attempt to create such a base was made in reiser4: unlike its predecessors, it has a fully modular structure. In reiser4, all implementations of the file_operations, inode_operations, address_space_operations methods are just thin layers - dispatchers who decide which plug-in ( module ) to transfer control to next. And each module implements some abstract class (interface) of a particular subscheme of interfaces, reflecting some concept of storing ( meta ) data.

I’ll try to explain on fingers how this all works. Let's say you want to implement btrfs functionality ( snapshots, etc.) As you know, this FS uses the transactional copy-on-write model, implemented on the basis of balancing the tree from top to bottom. This is its main difference from what Reiser4 currently offers. Therefore, we need to create a new “plug-in” plug-in of the “TMGR” interface ( transaction manager ), as well as a new “multi-root-tree” plug-in of the “TREE” interface for a storage tree with a root family ( “history” ) and balanced from above way down. In this case, the latter must be equipped with its own locking scheme. As for TMGR, it is an abstract class for managing objects, which in the next article are called “particles” (a concept dual to the transcrash primitive, the article is here ).

If you look closely at the transaction managers of different file systems ( currently there are three types of such managers ), it’s easy to notice some of their common features. Namely, in the TMGR interface, a set of the following main methods can be distinguished:

enter_context ();
try_capture ();
exit_context ().

The first and last are called respectively during the entry and exit of the process from the actual file system. The second is in all places where data ( pages or buffers ) are modified . Now reiser4 runs a single TMGR plugin, let's call it “jcow” (a symbiosis of the journalling and copy-on-write techniques without saving history ), the method -> try_capture () of which adds a block to the so-called "Atom" ( special name for "particles" in reiser4 ). And in our newly made plugin “cow” this method will buddy the new root of the storage tree ( in the btrfs code, the corresponding function is called btrfs_cow_block ).

As a home exercise, I propose to understand what exactly in this case will be “atoms” (those. should be sent to the entire disk ). For educational program, you can refer to the article by Ohad Rodeh "B-trees, shadowning, and clones".

You need to be able to add and extract these new roots somewhere: if you want the feature “writable snapshots”, then they must form a twice indexed set. But it's not a problem. For example, btrfs uses a separate "root tree" for this purpose.

In total, we only need two new plugins to get btrfs functionality. And really: why do we need something else? FILE interface plugins select items from the tree in accordance with the methods of the TREE interface and do not need to know how a particular tree is balanced. Plugins of other interfaces ( NODE, ITEM, etc.) also remain in action: why do we need to change the format of the tree nodes for organizing snapshots? Simply, our "multi-root" tree will contain different internal nodes that refer to the same blocks.

I’m not saying that programming new TREE and TMGR plug-ins is a job for the lazy, but believe me, it’s much easier than re-creating the file system and the complicated fsck utility ( which is also modular for reiser4 , for the most interesting point here it is that the existing plug-ins of other interfaces do not need to be taught to work with new family members, which means that there is no need to write and debug code, the percentage of which will tend to 100% (with a successful organized interface scheme, you successfully implement more than one functionality ).

In the same way, with the help of plugins, you can organize and manage logical volumes as in ZFS or btrfs. However, here I must warn: this will be the so-called leveling violation ( layering violation ). The fact is that in Linux volume management is carried out by a separate subsystem ( lvm ), and trying to mix it with the file system can end badly: you will be asked to remove this functionality, and no longer do this: here there is an inexplicable double standards policy: mix someone levels allowed ( e.g. btrfs), but in reiser4 this is not welcome. In any case, remembering the flurry of accusations against reiser4 on the topic of layering violation, I would not risk the effort.

Details and other equally interesting applications of modular architecture can be found in my article ( has not yet been published, it will be announced on the reiserfs-devel mailing list ).

So, I would describe the future of local file systems in particular as “polishing” of such “internal” interfaces. In fact, if you look closely, you will notice that they are not internal. It's like in algebra: if you have any linear space V splits into ( innerSince the direct sum of subspaces, then you can go the opposite way: use the construction of an external direct sum to construct a space that is isomorphic to V. Well, since they are not internal, then this is the property of all file systems. There are no problems with VFS here ( more about this in the article ). In general, here I see many analogies between software systems ( to a greater extent this relates to data storage systems ) and such concepts of homological algebra as module, grading, filtering, etc. which seem very useful to me.

And the last: about the "features". I am often asked about how to write a plugin for reiser4. Moreover, the answer to the question, and what will he implement with us, often puts the questioner in a dead end. I don’t like the idea of putting on stream production of “features” for a file system with a modular architecture. This is a discipline, not a massive entertainment industry. Nobody puts on stream the proof of mathematical theorems ...

I believe that at first there should be a useful idea from the field of information storage ( for example, snapshots ). I do not think that there can be too many such ideas. With such ideas - welcome. We will think about how to express it in the language of attachments, add, if necessary, new interfaces to the general scheme and write the corresponding plug-ins.

Unfortunately, “features” are often drained out of the finger: the law of the market ( that consumers definitely need to shock features ) did not pass by this “holy” sector: due to the absence of any ideas in the field of data storage, file systems begin to “boil the ocean” and deal with other paid, but completely unnecessary things.

Tags:

Interview with reiser4 developer Eduard Shishkin

Also popular now: