Binary (file) storage, scary tale with a gloomy end



    Daniil Podolsky (Git in Sky)


    My report is called "Binary, they are also file storage", but, in fact, we are dealing with a terrible tale. The problem is (and this is the thesis of my report) that now there is not only a good, but at least an acceptable file storage system.

    What is a file? A file is a piece of data with a name. What is important? Why is the file not a row in the database?

    The file is too large to be handled as one piece. Why? You have a service, since we have a HighLoad conference, you have a service that holds 100 thousand connections at the same time. This is not so much if for each of the connections we give a 1 MB file in size, but we need about 100 GB of memory for the buffers for these files.



    We cannot afford it. Usually, anyway. Accordingly, we have to divide these pieces of data into smaller pieces, which are called chunks. Quite often, and sometimes in blocks. And handle the blocks, not the whole file. Accordingly, we have yet another level of addressing, and a little later it will become clear why this affects the quality of binary repositories so much.


    Nevertheless, despite the fact that there are no good binary storages in the world, file exchange is the cornerstone of modern mass data exchange. Those. anything that is transmitted over the Internet is made out in the form of a file - starting from html pages and ending with streaming video, because on the other side we have a file with streaming video and on this side. Often on this side - no, but, nevertheless, it is still a file.

    Why is that? Why do we base our mass file sharing on file sharing? Because 10 years ago the channels were slow, 10 years ago we couldn’t afford JSON interfaces, we couldn’t request the server all the time, we had too much latency, we needed all the data that was required to be shown to the user, first upload, and then provide the user with the opportunity to interact with this data ... Because otherwise it all lied a little bit.



    File storages. In fact, the term “repositories” is extremely unfortunate, because they should be called “repositories”.

    As far back as 20-30 years ago, these were really repositories - you processed the data, you put them on a tape, and this tape was archived. This is the repository. Today, nobody needs it. Today, if you have 450 million files, it means that they should all be in hot availability. If you have 20 terabytes of data, this means that some 1 byte of these 20 terabytes, any of them, will be required by any of your users in the very near future. Unless you work in a corporate environment, but if you work in a corporate environment, the word HighLoad is rarely applied to corporate environments.

    In business requirements, it is never written “store files”, even when it is written ... What is called a backup system - no, this is not a backup system, it is a disaster recovery system, no one needs to store files, everyone needs to read files - this is important.

    Why do files still have to be stored? Because they must be given, and in order for them to be given, they must be with you. And, I must say that this is not always the case, i.e. many projects, for example, do not store html pages, but generate them on the fly, because storing a large number of pages is a problem, and generating a large number of html pages is not a problem, it is a well-scalable task.



    File systems are old and journaled. What is an old file system? We wrote the data to her, she more or less immediately sent this data to disk to the right place.

    What is a journaling file system? This is a file system that writes data to a disk in two stages: first, we write data to a special area of ​​the disk called a “log”, and then, when we have free time, or when the log is full, we transfer data from this log to where they must be on the file system. This was invented in order to accelerate the start. We know that we always have a consistent file system, so we don’t need to check the file system if we had an unsuccessful emergency restart, for example, of a server, but only a small log should be checked. It will be important.

    File systems were flat and hierarchical.

    • Flat is FAT12. You probably have not met with her. There were no directories in it, respectively, all the files were in the root directory and were available immediately by offset in the FAT table.

    • Hierarchical. In fact, organizing directories on a file system is not so difficult on a flat one. For example, in the project where we are solving the problem now, we did it. However, all modern file systems that you have encountered are hierarchical file systems, starting with NTFS, ending with some kind of ZFS. They all store directories as files, in these directories a list of the contents of these directories is stored, respectively, in order to get to a file of the 10th level of nesting, you need to open 10 files in turn, and the 11th one is yours. There are 100 IOPS on a SATA drive, you can do 100 operations per second with it, and 10 you have already spent, i.e. a file of the 10th level of nesting, if all these directories are not in the cache, you won’t open it in less than 0.1 seconds, even if your system does nothing else.

    All modern file systems support access control and advanced attributes except FAT. FAT, oddly enough, is still in use and does not support any access control and no advanced attributes. Why did I write here? Because it is a very important moment for me. A terrible tale related to file systems began for me in 1996, when I carefully studied how access control works in traditional UNIX. Do you remember the mask of rights? The owner, the owner group, everyone else. I had a task, I needed to create a group that can read and write to a file, a group that can read a file, and everyone else should not be able to do anything with this file. And then I realized that the traditional UNIX mask of rights does not support such a pattern.



    Just a little more theory. POSIX is the actual standard, it is now supported by all the operating systems that we use. In reality, this is simply a list of calls that the file system should support. What matters to us in all of this? Open The fact is that working with a file in POSIX does not take place by file name, but by some file handler, which you can request from the file system by file name. After that, you should use this handler for all operations. Operations can be simple, such as read, write, and seek, which makes it impossible to create a distributed file system standard POSIX.

    Why? Because seek is a random move to a random file position, i.e. in reality, we don’t know what we are reading, and we don’t know where we are writing. In order to understand what we are doing, we need a handler that the open operation returned to us, and we need the current position in the file - this is the second addressing. The fact that the POSIX file system supports random second addressing, and not just sequential ones such as “opened and let's read files from beginning to end,” or for example, “opened and let's write it, and every new recordable block is added to the end of the file”. The fact that POSIX requires that this not be so does not allow (more on this later) to create a good distributed POSIX file system.

    What else exists on POSIX file systems. In fact, not all POSIX support the same set of atomic operations, but in any case, a certain number of operations must be atomic. An atomic operation is an operation that occurs or does not occur at a time. It reminds you of transactions in databases, but in fact it just reminds. For example, on the ext4 file system, which we should all be familiar with, since we gathered at this conference, creating a directory is an atomic operation.



    The last about the theory. Various things that are actually not needed for the functioning of the file system, but are sometimes useful.

    • Online compression is when we compress it when recording a block. It is supported, for example, on NTFS.

    • Encryption. It is also supported on NTFS, and on ext4 neither compression nor encryption is supported, it is organized there using block devices that support both. In fact, neither one nor the other is required for the normal functioning of the file system.

    • Deduplication. A very important point for today's file systems. For example, we have a project in which 450 million files, but only 200 million chunks - this means that about half of the files are the same, just called differently.

    • Snapshots. Why is it important? Because you have a 5 TB file system - this means that a consistent copy of it cannot be created just like that. Unless you stopped all your processes and started reading from the file system. 5 terabytes will be read from a cheap SATA drive for about 6 hours, according to my estimates, well, 5 terabytes per hour. Can you stop your services for 5 hours? No, i guess. Therefore, if you need a consistent copy of the file system, you need snapshots.

      To date, snapshots are supported at the block device level in LVM, and there they are disgusting. Look, we create an LVM snapshot, and our linear reading turns into random, because we have to read in a snapshot, read on the basic block device. It is even worse with writing - we have to read on the basic volume, we have to write to snapshots, we have to read from the snapshot again. In reality, snapshots on LVM are useless.

      There are good snapshots on ZFS, there can be many of them, they can be transferred over the network, if you, for example, made a copy of the file system, then you can transfer the snapshot. In general, a snapshot is not required for file storage functionality, but very useful, and a little later it turns out that it is mandatory.

    The most recent, but perhaps the most important, of this whole theory.

    Read caching. Once, when the NT4 installer was launched from under MS DOS, installing NT4 without running smart drive (this is the read cache in MS DOS) took 6 hours, and with running smart drive it took 40 minutes. Why? Because if we do not cache the contents of directories in memory, we are forced to do these very 10 steps every time.

    Record Caching. In fact, until recently, it was believed that this was a very bad tone, that the only case when you can enable write caching on a device is if you have a battery-controlled trade controller. Why? Because this battery allows you to save in memory data that did not reach the server, which turned off at a random moment. When the server is turned on, you can add this data to disk.

    It is clear that the operating system can not support anything like this, it must be done at the controller level. Today, the relevance of this move has fallen sharply, because today we can use very cheap RAM, which is called an SSD drive. Today, caching writes to an SSD is one of the easiest and most effective ways to improve local file system performance.

    It was all about file systems local to your computer.



    High Load is a network. This is a network in the sense that your visitors come to you over the network, and this means that you need horizontal scaling. Accordingly, network file access protocols are divided into two groups: stateless - these are NFS, WebDAV and a number of other protocols.

    Stateless - this means that each subsequent operation is independent in the sense that its result does not depend on the result of the previous one. This is not the case with the POSIX standard file system. Our read and write results depend on the search results, and he, in turn, depends on the results of open. However, stateless NFS transfer protocols exist on top of POSIX file systems, for example, and this is his main problem. Why is NFS so shit? Because it is stateless on top of statefull.

    Statefull. Today, statefull protocols are increasingly being used in network communication. This is very bad. Statefull protocol for the Internet is very poorly suited, because we have unpredictable delays, but, nevertheless, more often some JSON javascript interface more often remembers how its previous communication with the server ended, and orders the next JSON, based on how the previous operation ended. Of network file systems with statefull protocol, CIFS is also called samba.

    Two-faced bitch fault tolerance. The fact is that traditional file systems rely on data integrity because their creators were fascinated by the word “storage”. They thought the most important thing in the data was to store it and protect it. Today we know that this is not so. We listened at RootConf’s report of a person who is involved in data protection in data centers, he firmly told us that they refuse not only hard disk arrays, but also software disk arrays. They use each disk separately and some system that monitors the location of data on these disks, the replication of this data. Why? Because if you have a disk array of, for example, five 4 TB disks, then it contains 20 TB. In order for one of the drives to crash after an accidental failure, for example, it must be restored, in reality, you need to read all 16 TB. 16 terabytes are read per terabyte per hour. Those. we only had one drive fly out, but in order to start the array again we need 16 hours - this is unacceptable in today's situation.

    Today binary storage fault tolerance is, first of all, uninterrupted reading and, oddly enough, writing. Those. your store should not turn into a pumpkin for the first sneeze, which is only occupied with saving the data hidden inside. If the data is lost, God bless them, they are lost, the main thing is that the show continues.

    What else is important to say about network binaries? The same CAP-theorem, i.e. choose any two of these three:

    1. or your data will always be consistent and always available, but then it will lie on the same server;
    2. or your data will always be consistent and it will be distributed among several servers, but it turns out that access to it is limited from time to time;
    3. or your data will always be available and distributed between servers, but then what you read from one server and from another will not be guaranteed to you at all.

    The CAP theorem is just a theorem, no one has proved it, but in fact it is true. Fails. Attempts are made constantly, for example, OCFS2 (Oracle Cluster Filesystem version 2), which I will mention a little later, is an attempt to prove the invalidity of the CAP-theorem.

    This is all about file systems. About binary storages, what is the problem? Let's figure it out.

    The easiest way that every system administrator who needs to store TB of data and millions of files comes to mind is to simply buy a large storage system (data storage system).



    Why is a large storage system not an option? Because if you have a large storage system and one server that communicates with it, or you were able to split your data into pieces, and one server communicates with your files with each piece, then you have no problems.

    If you have horizontal scaling, if you constantly add servers that should give these files or, God forbid, process them first, only then give them, you will encounter the fact that you can’t just put some kind of file system on a large storage system.

    When I fell into the hands of DRBD for the first time, I thought: great, I will have two servers, between them there will be replication based on DRBD, and I will have servers that will read from one, from the other. It quickly became clear that all servers cache reads - this means that even if we quietly changed something on a block device, a computer that itself did not change this and knew which cache to disable would not know about it, and would continue to read the data at all from those places where they actually already lie.

    In order to overcome this problem, there are different file systems that provide cache invalidation. In fact, they are doing this on all the computers that they mounted on shared storage.

    Also with this OCFS2 there is such a problem - brakes during competitive recording. Remember, we talked about atomic operations - these are operations that are atomic, that happen in one piece. In the case of a distributed file system, even if all of our data is on one single large storage system, the atomic operation of recruiting readers and writers requires that they all come to a consensus.

    Consensus in the network is network delays, i.e. writing OCFS2 really is a pain. In fact, Oracle is not such an idiot, they made a good file system, they just made it for a completely different task. They made it for sharing the same database files between several of their Oracle servers. Oracle database files have such a working pattern that they work perfectly on this OCFS2. It turned out to be unsuitable for file storage, we tried back in 2008. Even with OCFS2, it turned out to be unpleasant, because of time-monitoring, i.e. due to the fact that the time slightly differs on all virtual machines that we run even on the same host, OCFS2 does not work normally for us, i.e. at some point, it necessarily happens that the time on this consistency server went back, it falls at that place, etc.

    And even a large, large storage system will be quite difficult to get for your own use, i.e. for example, in Hesner you will not be given any large storage. I have such a suspicion that the idea that a large storage system is very reliable, very good, very high-performance, is simply connected with the correct calculation of required resources. You can’t just buy a large storage system, they are not sold in the store. You must go to an authorized vendor and talk to him. He nods his head, says, “It will cost you.” And it will count for you 50-100 thousand one chassis, i.e. it will also need to be filled with disks, but it will count correctly. This storage will be loaded at 5-10 percent, and if it turns out that your load has increased, they will advise you to put another one. This is about life.

    Okay, let it. A large storage system is not an option, we found out later with blood.



    We take some cluster file system. We tried several: CEPH / Luster / LeoFS.

    First, why so slow? It is understandable why - because synchronous operations across the cluster. What does rebalancing mean? On HDFS there is no automatic rebalancing for the data already lying on them. Why is she not there? Because at the moment when rebalancing occurs at CEF, we lose the opportunity to work with it. Rebalancing is a well-established procedure that consumes approximately 100% of the bandwidth of the disk exchange. Those. Saturation disk - 100%. Sometimes rebalancing, it does for every sneeze it takes, lasts 10 hours, i.e. the first thing people working with CEF do is learn to tighten the rebalancing intensity.

    In any case, on the very cluster that we are currently using in the project where we have a lot of files and a lot of data, we had to unscrew the rebalancing up, and there we really have a 100% saturation drive. Disks fail under such a load very quickly.

    Why is rebalancing so? Why does it happen for every sneeze? All these "why" have remained unanswered so far.

    And that same problem is atomic operations that must go through the entire cluster synchronously. As long as you have two machines in the cluster, everything is fine when you have 40 machines in the cluster, you find that all these 40 machines ... We have 40 2- the number of network packets that we must send. The rebalancing and consistency protocols try to deal with this, but so far not very successfully. So far, in this sense, systems with a single point of failure with a multi-node are a little in the lead, but also not very strong.



    Why can't you just put all the files in the database? From my point of view, this is exactly what should be done, because if we have files in the database, we have a large package of good tools for working with such things. We can work with databases in a billion lines and in petabytes, we can work with databases in a few billion lines, a few dozen petabytes, we can do it well. You want, take Oracle, you want - take some DB2, you want - take some NoSQL. Why? Because a file is a file. It is impossible to handle a file as an atomic entity, so distributed file systems do not exist well, and distributed databases exist normally.



    And the cross on all sorts of ACFS, Luster, etc., is that we need to back up files. How do you imagine a 20 TB backup? I remind you, TB per hour. And most importantly - where, how often, how to ensure consistency on such an amount, if we do not have a single file system, and we cannot take a snapshot. The only way that I personally see from this situation is file systems with versioning, when you write a new file, and the old one does not disappear anywhere and you can get to it by indicating the time you go to see the state of the file system. There must also be some sort of garbage collection.

    Microsoft promised us such a file system back in the 90s, but it did. Yes, there was a distributed file system for Windows, they even announced it for Longhorn, but then neither Longhorn nor this file system happened.

    Why is backup important? Backing up is not fault tolerance - it is protection against operator errors. I myself happened to mix up source and destination in the rsync command and get (a magical story!) A server running 16 virtual machines, but there are no files with their images, because I deleted them. I had to remove them using the DD command from the virtual machines themselves. Then nothing happened. But, nevertheless, we are obliged to provide versioning in our binary repositories, and there is no file system that would normally provide versioning, except for ZFS, which is not clustered at the same time and, accordingly, does not suit us.

    What to do? To start, study your own task.

    • Should I save? If you are able to put all the files on one of your storage systems and process them on one heavy-duty server. Now you can have a server with 2 TB of memory and several hundred cores. If you have enough budget for this, and you need file storage, do so. This can cost the whole business cheaper.

    • POSIX If you do not need accidental reading or accidental writing, then this is a big plus for you, you can cope with the existing set, for example, HDFS, mentioned earlier or CFS or Luster. Luster is an excellent file system for a computing cluster, but for a giving cluster it is no good.

    • Large files - are they needed? If all your files can be considered small (small - this, I recall, is a situation, not a file property), if you can afford to treat the file as a single piece of data, you have no problem - put it in the database - you have everything OK. Why on the project that I mention here, but do not name, did we succeed? Because there 95% of the files are less than 64 KB, respectively, this is always one line in the database, and in this situation everything works fine.

    • Versioning - is it necessary? In fact, there are situations when versioning is not required, but then backup is not required, these are situations when all your data is generated by your robots. In fact, your file storage is a cache. There is no room for operator error and there is nothing to lose.

    • How big should our storage be? If the capabilities of a single file system are enough to meet your needs, excellent, very good.

    • Are we going to delete files? Oddly enough, this is important. There is such a bike (in fact, this is not a bike) that VKontakte never removes anything, i.e. as soon as you uploaded some picture or some music there, it always lies there, links to this information are deleted, no recovery, i.e. there are no reuse for the place occupied by files in VKontakte. They say I listened to such a report. Why? Because as soon as you try to reuse the place, you immediately have serious problems with consistency. Why did OCFS2 fit the Oracle database? Because they do not reuse the place, because when you write new data to the database, they are simply added to the end of the file and that’s it. If you want to reuse the place, you run the CD, I don’t know if this is the case in modern Oracle, but that was so in the 2001th year. You run a compact - this is an offline operation, it ensures consistency by the fact that it exclusively owns the file that it processes. Are we going to reuse disk space? The same VKontakte pokes new disks and it’s normal, and I think that it’s necessary.

    • What will be the load profile? Reading, writing. For many distributed file systems, write performance is very draining, why? Because ensuring consistency, because atomic operations, because synchronous operations across the cluster. NoSQL databases have only one synchronous cluster operation. Usually incriminating a record version. There may not be data, they may come later, but the version of a particular record, all the nodes, everyone should think the same thing about it. And this is not the case for all NoSLQs, for example, Cassandra does not bother with this, Cassandra does not have synchronous operations across the cluster. If you are just reading, try some of the clustered file systems, maybe you can do it. These are success stories when people come up and say: "Why did you do all this, just take Luster." Yes,

    For some combinations of requirements, for some tasks, the solution is directly existing, but for some combinations it is not, and it really is not. If you searched and did not find it means that it is not there.

    What to do after all? Here's where to start:

    • you can go and beg for these 200 thousand euros to the authorities for a couple of months and, when they are nevertheless given, to do well. Just beg not 200 thousand, but first go to the vendor and calculate with him how much you need to beg, and then beg for about one and a half times more;

    • still add all the files to the database - I went this way. We put our 450 million files into the database, but this trick was a success because we do not need any POSIX and we have 95% of the files are small;

    • You can write your file system. In the end, a variety of algorithms exist, here we wrote ours on top of the NoSQL database, you can take something else. We wrote our first version on top of Postgresql RDBMS, but here we had some problems, not immediately, but after 2 years, but nonetheless. Actually it’s not very difficult, even writing a POSIX file system is not very difficult, take FUSE and go, there are not so many calls, they can all be implemented. But in reality, it’s difficult to write a well-functioning file system.

    This report is a transcript of one of the best presentations at a training conference of developers of highly loaded systems HighLoad ++ Junior . Now we are actively preparing an adult brother - the 2016 conference - this year HighLoad ++ will be held in Skolkovo on November 7 and 8.

    This year there are a lot of interesting topics on data warehousing, here are the most controversial and furious reports:



    Also, some of these materials are used by us in an online training course on the development of highly loaded systems HighLoad. Guide is a chain of specially selected letters, articles, materials, videos. Already in our textbook more than 30 unique materials. Get connected!

    By the way, tomorrow in our blog continued - a transcript of yet another report by Daniil Podolsky, " Experience in building and operating a large file storage ." Subscribe , what really there :)

    Also popular now: