Data storage algorithm "Log structured storage"

As a rule, if you are developing storage systems - such as a file system or database - one of the main problems is how to store data on disk. When developing, you must take care of a number of tasks. For example, about allocating space for objects that you are going to store. And also about indexing data so that you don’t have to worry about what happens when you want to expand an existing object (for example, when adding data to a file), and about fragmentation that occurs when old objects are deleted and new ones take their place. All this leads to a lot of difficulties, and solving frequent buggies or it constantly turns out inefficient.

Structured Storage Method, takes care of all these issues. It appeared in the 1980s, but recently we see an increasing use of this method as a way to structure storage in database engines. In its initial use of the file system, it suffers from some flaws that prevent widespread distribution, but, as we will see, they are not so important for the DBMS, and the structured storage log leads to additional advantages for the database.

The basis for organizing a structured data storage log system, as the name implies, is a log - that is, everything happens by sequential recording of data. Whenever you have new data to write, instead of finding a place for it on the disk, you simply add it at the end of the log. Data is indexed by processing metadata: metadata updates also occur in the journal. This may seem inefficient, but on the disk the index structures are B-trees, as a rule they are very wide, therefore we need to update the number of index nodes with each record, as a rule, are very small. Let's look at a simple example. We start with the logs containing only one data item, as well as the index of the node that references it:



So far so good. Now suppose we want to add a second element. We add a new element at the end of the log, and an updated version of the pointer:



The original entry with index (A) is still in the file, but it is no longer used: it has been replaced by a new entry A ', which refers to the original copy of Foo, as well as new record Bar. When we want to read from our file system, we find the index of the root node, and uses it as it would on any other system.

Quick root index search. With a simple approach, you could just look at the last block in the log, since the last thing we write is always the root index. However, this is not ideal, since it is entirely possible that at the time of reading the index, another process adds the middle of the log. We can avoid this by having a single block - say, at the beginning of the log - that contains a pointer to the current root node. Whenever we update the log, we will rewrite the first record to make sure that it points to the new root node.

Next, let's see what happens when we update an item. Let's say we change Foo:



We started recording a brand new copy of Foo at the end of the magazine. Then we updated the node index again (only in this example) and also wrote it at the end of the log. Once again, the old copy of Foo remains in the log, but the updated index simply does not refer to it anymore.

You probably understood that this system is not stable for an indefinite period. At some point, the old data will just take up space. In the file system, this is seen as a ring buffer, overwriting old data. When this happens, the data that remains valid is simply added to the log again, as if it had just been written, and not the old copies that were needed will be overwritten.

On the file system, one of the drawbacks that I mentioned earlier is not a good block in the log header. When we get the full information from the disk, the file system should spend more and more of our time collecting garbage, and constantly write data to the log header. And when you reach 80%, your file system practically stops.

If you use a structured storage log for the database, that is not a problem! If we divide the database into several pieces of a fixed length, then when we need to return some space, we can select a piece, rewrite it with new data, and delete the piece. The first segment in our example begins to look a little freer:



All we did here was take an existing copy of Bar and write it at the end of the log, after which we updated the index of the node (s) as described above. After we have done this, the first segment of the journal can be safely deleted.

This approach has several advantages over the file system approach. To begin with, we are not limited to deleting old segments, if the intermediate segment is almost empty, we can simply collect garbage. This is especially useful for databases if there is some data that is stored for a long time, and which data is overwritten several times. We do not want to spend too much time rewriting unmodified data. We also have great flexibility in garbage collection.

The benefits of this approach for the database do not end there. In order to ensure transactional consistency, databases typically use Write Ahead Log, or WAL. When the database is saved to disk, it first writes all the changes to the WAL, then to disk, and then updates the existing database files. This allows you to restore after an accident the changes recorded in the WAL. If we use a structured storage log, WAL database, so that we can only write data once. When restoring, we simply open the database, starting from the last recorded index, and search for any missing indexes.

Using our recovery scheme from above, we can continue to optimize our record. Instead of writing an update of the index nodes for each record, we can cache them in memory, and only write them periodically to disk. Our recovery engine will take care of recovery until we suggest that it distinguish between normal and incomplete entries.
We can also, gradually, copy our databases with each new segment of the journal to backup media.

Last Main advantage of this system relates to parallelism and semantics of transactions in databases. In order to ensure transactional consistency, most databases use sophisticated lock systems to control which processes can update data and at what time. Depending on the level of consistency required, this can attract readers by taking out the keys to make sure that the data does not change when they read it, and also writing data keys for writing, can cause a significant decrease in performance even with relatively low numbers, if enough simultaneous readings.

When it comes to data recording, we can use Optimistic Concurrency. In typical read-modify-write cycles, we first perform our read operations as described above. Then, to record our changes, we lock the records for the database and make sure that none of the data that we read in the first phase. We can do this quickly by looking at the index if the data address is the same as when we last read. If it is not recorded, we can begin to change. If everything is different, we just go back and start reading again.

Also popular now: