S3 metadata in PostgreSQL. Yandex lecture

    This is the second lecture from J.Subbotnik on databases - we published the first one a couple of weeks ago.

    Dmitry Sarafannikov, the head of the general purpose DBMS group, spoke about the evolution of data storage in Yandex: how we decided to make an S3-compatible interface, why we chose PostgreSQL, what rakes we attacked and how we coped with them.


    - Hello! My name is Dima, in Yandex I deal with databases.I'll tell you how we did S3, how we came to the point of doing exactly S3, and what kind of vaults we had before. The first one is Elliptics, it is posted on the open source, available on GitHub. Many may have come across it.

    This is, in fact, a distributed hash table with a 512-bit key, the result of SHA-512. It forms a key ring that is randomly divided between machines. If you want to add machines there, the keys are redistributed, rebalancing takes place. This repository has its own problems associated, in particular, with rebalancing. If you have a sufficiently large number of keys, then with ever-increasing volumes you need to constantly drop cars there, and rebalancing may simply not converge on a very large number of keys. It was a big enough problem.

    But at the same time, this storage is great for more or less static data, when you fill up a large amount of one-time data, and then chase them read-only-load. For such solutions, this is great.

    We go further. The problems with rebalancing were quite serious, so the next storage appeared.

    What is its essence? This is not a key-value-storage, it is a value-storage. When you upload an object or file there, it answers you with a key, which can then be used to pick up this file. What does this give? Theoretically, one hundred percent availability for recording, if you have free space in the storage. If you have some typewriters, you simply write to others that do not lie, on which there is free space, get other keys and calmly collect your data.

    This storage is very easy to scale, you can throw it with iron, it will work. It is very simple, reliable. Its only drawback: the client does not control the key and all clients must store keys somewhere, store the mapping of their keys. All this is uncomfortable. In fact, this is a very similar task for all clients, and everyone solves it in their own metabase, etc. This is inconvenient. But at the same time, I don’t want to lose the reliability and simplicity of this storage, in fact it works at the speed of the network.

    Then we started looking at S3. This is key-value storage, the client controls the key, all storage is divided into so-called buckets. In each bucket, the key space is from minus infinity to plus infinity. The key is some kind of text string. And we stayed on it, on this option. Why S3?

    Everything is quite simple. By this moment, many ready-made clients have already been written for various programming languages, many ready-made tools have already been written for storing something in S3, say, database backups. Andrei talked about one of the examples. There is already quite thoughtful API, which for years was run-in on clients, and there is nothing to invent. The API has many convenient features such as listings, multipart-apploads, and so on. Therefore, we decided to stay on it.

    How to make S3 from our storage? What comes to mind? Since the clients themselves keep the key mapping, we will simply take, put a number of databases, and store the key mapping in it. When reading, we will simply find the keys and storage in our database, and give the client what he wants. If you sketch this out, how does the fill work?

    There is a certain entity, here it is called Proxy, the so-called backend. It takes the file, uploads it to storage, receives the key from there and saves it to the database. Everything is quite simple.

    How is the receipt? The proxy finds the required key in the database, comes with the key to storage, downloads the object from there, gives it to the client. Everything is also simple.

    How is the removal? The proxy does not work directly with the storage, since it is difficult to coordinate the base and storage, so it simply goes to the database, tells it that this object is deleted, the object is moved to the queue for deletion, and then in the background, a specially trained professional the robot takes these keys, removes them from storage and from the base. It's all quite simple too.

    We have selected PostgreSQL as the database for this metabase.

    You already know that we love him very much. With the relocation of Yandex.Mail, we have gained sufficient expertise on PostgreSQL, and when different mail services were moving, we developed several so-called sharding patterns. One of them went well with the S3 with slight modifications, but it went well there.

    What are the sharding options? This is a large repository; on the scale of Yandex, one must immediately think that there will be many objects, one must immediately think through how to shuffle all this. You can shard by hash on behalf of the object, this is the most reliable way, but it will not work here, because S3 has, for example, listings that should show a list of keys in a sorted order, when you cache, all your sortings will be gone, you need to remove all objects so that the output conforms to the API specification.

    The next option can be shaded by hash on behalf of or id of the bake. One bucket can live inside one DB shard.

    Another option is to shard across key ranges. Inside the baketa, the space is from minus infinity to plus infinity, we can divide it into any number of ranges, we call this range a chunk, it can live only in one shard.

    We chose the third option, sharding by chunks, because, theoretically, in one bucket there can be an infinite number of objects, and it stupidly does not fit into one piece of metal. There will be big problems, so we will cut and decompose into shards as we like. Everybody is here.

    What happened? The entire database consists of three components. S3 Proxy - a group of hosts, there is also a database. PL / Proxy stand under the balancer, requests from that backend go there. Further S3Meta, such a group of bass, which stores information about the buckets and chunks. And S3DB, shards, where objects are stored, turn to delete. If you sketch, it looks like this.

    A request comes to S3Proxy, it goes to S3Meta and to S3DB and provides information to the top.

    Consider more. S3Proxy, functions created in it in the procedural language PLProxy, is a language that allows you to execute remotely stored procedures or queries. This is what the ObjectInfo function code looks like, in essence, a Get request.

    The cluster on LProxy has a Cluster operator, in this case it is db_ro. What does it mean?

    If the typical configuration of the DB shard is, there is a master and two replicas. The master is included in the db_rw cluster, all three hosts are included in the db-ro, this is where you can send a read only request, and a write request is sent to db_rw. The db_rw cluster includes all masters of all shards.

    The following operator is RUN ON, it takes either the value all, which means to execute on all shards, or an array, or some kind of shard. In this case, it accepts the result of the function get_object_shard as input, this is the number of the shard on which this object lies.

    And target - what function to call on the remote shard. He will call this function and substitute the arguments that came to this function.

    The get_object_shard function is also written in PLProxy, already a meta_ro cluster, the request will fly to the S3Meta shard, which will return this function get_bucket_meta_shard.

    S3Meta can also be shaded, we also laid it, while it is irrelevant, but there is a possibility. And it will call the get_object_shard function on S3Meta.

    get_bucket_meta_shard is just a hash of the text in the name of the batch, we just shard S3Meta with the hash in the name of the bake.

    Consider S3Meta, what happens in it. The most important information that is there is a table with chunks. I cut out some unnecessary information, the most important thing left is the bucket_id, the initial key, the final key and the shard in which this chunk lies.

    What would a query look like on such a table that returns us a chunk in which, for example, a test object lies? Something like this. Minus infinity in text form, we presented it as a null value, there are such subtle points that you need to check start_key and end_key for is Null.

    The request does not look very good, and the plan looks even worse. As one of the variants of the plan of such a request, BitmapOr. And 6,000 costa is such a plan.

    How can it be different? There is such a great thing in PostgreSQL as gist index, which can index range type, range in fact, what we need. We made this type, the function s3.to_keyrange returns us, in fact, a range. We can verify with the contains operator, find the chunk that contains our key. And for this, exclude constrain is built here, which provides for non-intersection of these chunks. We need to allow, preferably at the database level, to make some constraint so that the chunks cannot intersect with each other, so that only one row is returned in response to the query. Otherwise, it will not be what we wanted. This is the plan for such a query, the usual index_scan. This condition is entirely in index condition, and such a plan has only 700 costa, 10 times less.

    What is Exclude Constraint?

    Let's create a test table with two columns, and put two constraint on it, one unique one that everyone knows, and one exclude constraint, which has parameters equal to, such operators. Let's set with two operators equally, built such a table.

    Then we try to insert two identical lines, we get an error of violation of the key uniqueness on the first constraint. If we drop it, we have already violated the exclusion constraint. This is a common case of unique constraint.

    In fact, a unique constraint is the same exclude constraint with operators equal, but in the case of an exclude constraint you can build some more general cases.

    We have such indexes. If you look closely, you will see that these are both gist index, and in general they are the same. You probably ask why duplicate this thing at all. I'll tell you.

    Indexes are such a thing, especially the gist index, that the table lives its own life, updates happen, divides, and so on, the index deteriorates there, ceases to be optimal. And there is such a practice, in particular the pg repack extension, the indices are rebuilt periodically, once in a while they are rebuilt.

    How to rebuild an index under unique constraint? Create a create index currently, create the same index quietly next to it without blocking, and then use the alter table expression to constraint user_index so and so. And that's it, everything is clear and good, it works.

    In the case of exclude constraint, you can rebuild it only through reindex blocking, or rather your index will be blocked exclusively, and in fact you will have all the requests. This is unacceptable, the gist index can be built long enough. Therefore, we keep close to the second index, which is smaller in volume, takes up less space, the glider uses it, and we can rebuild that index competitively without blocking.

    Here is a graph of CPU consumption. Green line - CPU consumption in user_space, it jumps from 50% to 60%. At this point, consumption drops sharply, this is the moment when the index is rebuilt. We rebuilt the index, the old one was deleted, our CPU consumption has plummeted. This is a gist index problem, it is, and this is a clear example of how this can be.

    When we did all this, we started on version 9.5 S3DB, according to the plan we planned to lay 10 billion objects in each shard. As you know, more than 1 billion and even earlier problems begin, when there are many rows in the table, everything becomes much worse. There is a practice of parting. At that time there were two options, either standard through inheritance, but this does not work very well, since there is a linear speed for selecting a partition. And judging by the number of objects, we need a lot of partitions. The guys from Postgres Pro then actively sawed the pg_pathman extension.

    We chose pg_pathman, we had no other choice. Even version 1.4. And as you can see, we use 256 partitions. We split the entire table of objects into 256 partitions.

    What does pg_pathman do? With such an expression, you can create 256 partitions that are partitioned by hash from the bid column.

    How does pg_pathman work?

    It registers its hooks in the glider, and further on the requests it replaces, in fact, a plan. We see that he didn’t search for 256 partitions on the usual search request for the object named test, but immediately determined that it was necessary to go into the objects_54 table, but this was not all smooth, pg_pathman has its own problems. Firstly, there were a lot of bugs in the beginning, while it was sawed, but thanks to the guys from Postgres Pro, they promptly repaired and fixed them.

    The first problem is the complexity of updating it. The second problem is prepared statements.

    Consider more. In particular, an update. What does pg_pathman consist of?

    It consists, in essence, of the C code, which is packaged in the library. And it consists of a SQL part, all the functions of creating partitions and so on. Plus, interfaces to the functions that are in the library. These two parts cannot be updated at the same time.

    This leads to difficulties, something like this algorithm for updating the pg_pathman version, we first roll a new package with a new version, but the old versions of PostgreSQL are still loaded in memory, it uses it. This is immediately in any case, the base must be restarted.

    Next we call the set_enable_parent function, it turns on the function in the parent table, which is turned off by default. Next, turn off the pathman, restart the database, say ALTER EXTENSION UPDATE, at this time we all fall into the parent table.

    Next, turn on the pathman, and run the function, which is in the extension, which shifts the objects from the parent table that were attacked there during this short period of time, shifts it back to the tables where they should be. And then turn off the use of the parent table, search in it.

    The next problem is prepared statements.

    If we prescribe the same usual request, search by bid and key, we will try to execute it. We do it five times - all is well. We carry out the sixth - we see such a plan. And in this plan we see all 256 partitions. If you look closely at these conditions, we see here 1 dollar, 2 dollar, this is the so-called generic plan, the general plan. The first five queries were built individually, individual plans were used for these parameters, pg_pathman could immediately determine, because the parameter is known in advance, could immediately determine the table where to go. In this case, he cannot do this. Accordingly, the plan should have all 256 partitions, and when the executor goes to execute this, he will go and take shared lock on all 256 partitions, and the performance of such a solution will not work right away. She just loses all her advantages,

    How did we get out of this situation? I had to wrap everything inside the stored procedures in execute, in dynamic SQL, so that the prepared statements were not used and the plan was built each time. So it works.

    The downside is that you have to push all the code into constructions that touch these tables. Here it is more difficult to read.

    How is the distribution of objects? In each S3DB shard, chunk counters are stored, there is also information about which chunks in this shard lie, and counters are stored for them. For each mutable operation on an object — add, delete, modify, overwrite — these counters for the chunk are changed. In order not to update the same line when there is an active fill in this chunk, we use a fairly standard method, when we insert a delta counter into a separate table, and once a minute a special robot passes and aggregates all of this, updates the chunk counters .

    Further, these counters are delivered to S3Meta with some delay, there is already a complete picture of how many counters are in which chunk, then it is possible to look at the distribution by shards, how many shards of objects, and on the basis of this, the decision is made where the new chunk falls. When you create a bake, for the bake, by default, a single chunk from minus infinity to plus infinity is created, depending on the current distribution of objects that S3Meta knows, it falls into a shard.

    When you fill in this baket data, all this data is poured into this chunk; when a certain size is reached, a special robot comes in and divides this chunk.

    We make them so small. We do this so that, in which case, this small chunk can be dragged to another shard. How is split chunk? Here is a regular robot, it goes and this chunk in S3DB will split the two-phase commit and update the information in S3Meta.

    Transferring a chunk is a slightly more complicated operation; it is a two-phase commit over three bases, S3Meta and two shards, S3DB, from one pitch, to the other one.

    In S3, there is such a feature as listings, this is the most difficult thing, and there are problems with it too. In fact, the listings, you say this S3 - show me the objects that I have. Red is the parameter that now has the value Null. This parameter, delimeter, delimiter, you can specify listings with which delimiter you want.

    What does it mean? If the delimeter is not specified, we see that we are simply given a list of files. If we set a delimeter, in fact, S3 should show us the folders. It must be realized that there are such folders, and in fact, shows all the folders and files in the current folder. The current folder is set by prefix, this parameter is here Null. We see that there are 10 folders there.

    All keys are not stored in a hierarchical tree structure, as in a file system. Each object is stored in a string, and they have a simple common prefix. S3 should understand that this ass.

    Such logic badly enough lays down on declarative SQL, it is easy enough to describe it the imperative code. The first option was made that way, just stored procedures on PL / pgSQL. He imperatively processed this logic in a loop, required a level of repeatable read. We need to see only one snapshot, execute all requests with one snapshot. Otherwise, if someone pours something after the first request, we will get inconsistent listings.

    Then we managed to rewrite all this with Recursive CTE, it turned out very cumbersome with complex logic, you can't figure it out without half a liter, and all this is wrapped in an execute inside PL / pgSQL. But received acceleration, in some cases up to a hundred times. Here are, for example, graphs of percentiles, response timings of the function list objects. What was before and after.

    The effect is visually noticeable, and the load too.

    We carried out optimization in several stages. Here is another chart of another optimization, when we have high quantiles just dropped to low.

    For testing, we use Docker, about Behave and testing Behave there is a wonderful report by Alexander Klyuev. Be sure to look, everything is very convenient, of course, tests are now writing happiness.

    We have something else to optimize. The most acute problem, as I showed you, is the CPU consumption on S3Meta. Gist index eats up a lot of CPU, especially when it becomes suboptimally built after numerous updates, delites. CPU on S3Meta is clearly not enough. You can stamp replicas, but it will be inefficient utilization of iron. We have a group of hosts with PLProxy under the balancer, which stand and remotely call functions on S3Meta and S3DB. In fact, there the processor can be forced to burn the proxy. To do this, you need to organize a logical replication of these chunks from S3Meta to all proxies. In principle, we plan to do this.

    In logical replication there are a number of problems that we will solve, we will try to push it upstream. The second option - you can abandon the gist, try putting this text range in btree. This is not a one-dimensional type, and btree works only with one-dimensional types. But the condition that the chunks should not intersect with us allows our case to be put in btree. We just yesterday made a prototype that works. It is implemented on PL / pgSQL functions. We received noticeable acceleration, we will optimize in this direction.

    Also popular now: