Yandex.Mail success story with PostgreSQL
Vladimir Borodin (on Habr’s dev1ant ), the system administrator of the storage system operation group at Yandex.Mail, introduces the difficulties of migrating a large project from Oracle Database to PostgreSQL. This is a transcript of the report from the HighLoad ++ 2016 conference .
Hello everyone! My name is Vova, today I will talk about the Yandex.Mail databases.
First, a few facts that will matter in the future. Yandex.Mail - the service is quite old: it was launched in 2000, and therefore we have accumulated a lot of legacy. At our place, as it is customary and fashionable to say, we have a very highload service, more than 10 million users per day, some hundreds of millions in total. In the backend, more than 200 thousand requests per second at the peak arrive. We add more than 150 million emails per day that have been tested for spam and viruses. The total volume of letters for all 16 years is more than 20 petabytes.
What will it be about? About how we transferred metadata from Oracle to PostgreSQL. Metadata there is not petabytes - there are a little more than three hundred terabytes. More than 250 thousand requests per second flies into the database. It should be borne in mind that these are small OLTP requests, mostly reading (80%).
This is not our first attempt to get rid of Oracle. In the early 2000s, there was an attempt to move to MySQL, it failed. In 2007 or 2008 there was an attempt to write something of their own, it also failed. In both cases there was a failure, not so much for technical reasons, but for organizational reasons.
What is metadata? Here they are highlighted by arrows. These are folders, which are some kind of hierarchy with counters, labels (also, in fact, lists with counters), collectors, threads, and, of course, letters.
We do not store the letters themselves in metabases; the bodies of letters are in a separate repository. In the metabases we store envelopes. Envelopes are some mail headers: from whom, to whom, subject of the letter, date and similar things. We store information about attachments and conversations.
Back to 2012
All this lay in Oracle. We had a lot of logic in the most stored database. Oracle databases were the most efficient iron for recycling: we added a lot of data to shards, more than 10 terabytes. Relatively speaking, with 30 cores we had a normal working load average of 100. This is not when everything is bad, but during normal operation.
There were few bases, so much was done by hand, without automation. There were a lot of manual operations. To save money, we divided the bases into “warm” (75%) and “cold” (25%). “Warm” is for active users, they are with SSD. “Cold” - for inactive users, with SATA.
Sharding and fault tolerance are an important topic in Yandex. Sharding - because you don’t push everything into one shard, and fault tolerance - because we regularly take and disconnect one of our data centers to see that everything works.
How was this implemented? We have an internal BlackBox service (black box). When one request arrives at one of our backends, the backend exchanges authentication data - login, password, cookie, token or something like that. He goes with this to the BlackBox, which if successful returns him the user ID and the name of the shard.
Then the backend fed this shard name to the OCCI Oracle driver, then all the fault tolerance logic was implemented inside this driver. That is, roughly speaking, in a special file /etc/tnsnames.ora were written shardname and a list of hosts that are included in this shard, it is served. Oracle itself decided which of them was the master, who was the replica, who was alive, who was dead, etc. In total, sharding was implemented using external services, and fault tolerance was implemented using the Oracle driver.
Most of the backends were written in C ++. In order not to produce “bicycles”, for a long time they had a common abstraction of macs meta access. This is just an abstraction for going to the base. Almost all the time, she had one implementation of macs_ora to go directly to Oracle. At the very bottom, of course, is OCCI. There was also a small dbpool layer that implemented the connection pool.
That's how it was once conceived, designed and implemented for a long time. Over time, the abstractions have leaked, the backends have started using methods from the macs_ora implementation, even worse if from dbpool. Java and all sorts of other backends appeared that could not use this library. Then all this noodles had to be painfully raked.
Oracle is a great database, but there were problems with it. For example, laying out PL / SQL code is a pain because there is a library cache. If the database is under load, then you can’t just take and update the function code, which is now used by some sessions.
The remaining problems are associated not so much with Oracle as with the approach that we used. Again: a lot of manual operations. Switching wizards, loading new databases, starting user transfers - everything was done by hand, because there were few bases.
From a development point of view, there is a drawback in that the plus [C ++] Oracle driver has only a synchronous interface. That is, a normal asynchronous backend cannot be written on top. This caused some pain in development. The second pain in the development was that raising the test base is problematic. Firstly, because by hand, and secondly, because it is money.
Whoever says it, Oracle has support. Although the support of enterprise-companies is often far from ideal. But the main reason for the transition is money. Oracle is expensive.
In October 2012, more than 4 years ago, it was decided that we should get rid of Oracle. No PostgreSQL words were heard, no technical details were heard - it was a purely political decision: to get rid, a period of 3 years.
Six months later, we started the first experiments. What have been spent these six months, I can tell a little later. These six months were important.
In April 2013, we experimented with PostgreSQL. Then there was a very fashionable trend for all sorts of NoSQL solutions, and we tried a lot of different things. We remembered that all metadata is already stored in the mail search backend, and maybe you can use it. We also tried this solution.
The first successful experiment was with the collectors, I talked about it at the meeting in Yandex in 2014 .
We took a small piece (2 terabytes) of rather loaded (40 thousand queries per second) mail metadata and took it from Oracle to PostgreSQL. The piece that is not very related to the basic metadata. We did it, and we liked it. We decided that PostgreSQL is our choice.
Next, we saw a prototype of the mail scheme already for PostgreSQL and began to add the entire flow of letters into it. We did this asynchronously: we stacked all 150 million letters a day in PostgreSQL. If the patch would fail, then we would not care. It was a pure experiment, he didn’t hurt production.
This allowed us to test the initial hypotheses with the circuit. When there is data that is not a pity to throw out - it is very convenient. I made some scheme, stuffed letters into it, saw that it wasn’t working, I dropped everything and started again. Excellent data that can be dropped, we love it.
Also, thanks to this, it turned out to some extent to carry out load testing directly under live load, and not some kind of synthetics, not on separate stands. It so happened to make the initial iron estimates, which is needed for PostgreSQL. And of course, experience. The main goal of the previous experiment and prototype is experience.
Then the main work began. Development took about a year of calendar time. Long before it ended, we moved our mailboxes from Oracle to PostgreSQL. We always understood that there will never be such a thing that we will show everyone one night “sorry, technical work”, transfer 300 terabytes and start working on PostgreSQL. This does not happen. We would definitely break down, roll back, and everything would be bad. We understood that there will be a rather long period of time when some of the mailboxes will live in Oracle, and some in PostgreSQL, there will be a slow migration.
In the summer of 2015, we moved our boxes. The Mail team, which writes, tests, admin, and so on, moved their mailboxes. This greatly accelerated the development. Abstract Vasya suffers, or you suffer, but you can fix it - these are two different things.
Even before we added and implemented all the features, we started to carry inactive users. We call inactive the user to whom the mail arrives, we add letters, but he does not read them: neither the web, nor mobile, nor IMAP — they are not interesting to him. There are such users, unfortunately. We started to carry such inactive users when, say, IMAP was not yet fully implemented, or half of the pens in the mobile application did not work.
But this is not because we are so brave and decided to break all the boxes, but because we had plan B in the form of a reverse transfer, and it helped us a lot. There was even automation. If we transferred the user, and he suddenly tried, for example, accessing the web interface - he woke up and became active - we transferred him back to Oracle, so as not to break any features for him. This allowed us to fix a bunch of bugs in the transfer code.
Then followed the migration. Here are some interesting facts. We spent 10 man-years to rewrite all our noodles, which we have accumulated over 12-15 years.
The migration itself went very fast. This is a 4 month schedule. Each line is a percentage of the load that the service sends from PostgreSQL. Broken into services: IMAP, web, POP3, tab, mobile and so on.
Unfortunately, the abyss cannot be jumped over 95%. We could not transfer everyone to April, because registration remained in Oracle, this is a rather complicated mechanism. It so happened that we registered new users in Oracle and immediately transferred them to PostgreSQL at night. In May, we filmed the registration, and in July we already paid off all Oracle databases.
Another abstraction of macs_pg appeared in our abstraction, and we untangled all the noodles. All those abstractions that had taken place had to be carefully rewritten. At the bottom of it is libpq, they made another small layer of apq, where the connection pool, timeouts, error handling, and all this are asynchronous.
Sharding and fault tolerance are all the same. The backend receives authentication data from the user, exchanges it in the BlackBox for the user ID and the name of the shard. If the name of the shard has the letter pg, then it makes another request to a new service, which we called Sharpei. The backend passes there the identifier of this user and the mode in which he wants to get the database. For example, “I want a master,” “I want a synchronous replica,” or “I want the closest host.” Sharpei returns the connection strings to him. Next, the backend opens the connection, holds it and uses it.
To know the information, who is the master, who is the replica, who is alive, who is dead, who is behind, who is not, Sharpei once a second goes to the end bases and asks for their statuses. At this point, a component appeared that took on both functions: sharding and fault tolerance.
In terms of iron, we made several changes. Since Oracle is licensed for processor cores, we have been forced to scale vertically. We stuffed a lot of memory on one processor core, a lot of SSD disks. There was a small number of bases with a small number of processor cores, but with huge arrays of memory and disks. We always had strictly one replica for fault tolerance, because all subsequent ones are money.
In PostgreSQL, we changed the approach. We began to make smaller bases and two replicas in each shard. This allowed us to wrap reading loads on the replicas. That is, in Oracle everything was serviced from the master, and in PostgreSQL - three machines instead of two smaller ones, and we wrap up the reading in PostgreSQL. In the case of Oracle, we scaled vertically, in the case of PostgreSQL, we scaled horizontally.
In addition to “warm” and “cold” bases, “hot” ones also appeared. Why? Because we suddenly found that 2% of active users create 50% of the load. There are such bad users who rape us. Under them we made separate bases. They are not much different from warm ones, they also have SSDs, but they are less by one processor core, because the processor is more actively used there.
Of course, we filmed the automation of user transfer between shards. For example, if the user is inactive, now lives in a satashny [with a SATA-drive] database and suddenly started using IMAP, we will transfer him to a "warm" base. Or if he doesn’t move for six months in a warm base, then we will transfer him to the “cold” one.
Moving old active user emails from SSD to SATA is what we really want to do, but we still cannot. If you are an active user, live on an SSD and you have 10 million emails, they all lie on an SSD, which is not very effective. But so far, PostgreSQL has no normal partitioning.
We have changed all identifiers. In the case of Oracle, they were all globally unique. We had a separate base where it was written that in this shard there are such ranges, in this - such. Of course, we had a fakap when, due to an error, identifiers crossed, and about half of everything was tied to their uniqueness. It was painful.
In the case of PostgreSQL, we decided to switch to a new scheme, when our identifiers are unique within a single user. If before the mid letter identifier was unique, now the uid mid pair is unique. In all the tablets we have the first uid field, everything is prefixed by it, it is part of it everywhere.
Besides the fact that this is less space, there is another non-obvious plus. Since all of these identifiers are taken from sequences, we have less competition for the last page of the index. At Oracle, we've added reverse indexes to solve this problem. In the case of PostgreSQL, since the inserts go to different pages of the index, we use the usual B-Tree, and we have range scans, all the data of one user in the index are nearby. It is very comfortable.
We have introduced revisions for all objects. This made it possible to read from replicas, firstly, non-obsolete data, and secondly, incremental updates for IMAP, mobile. That is, the answer to the question “what has changed in this folder with such a revision” was greatly simplified due to this.
In PostgreSQL, everything is fine with arrays, composites. We did a denormalization of the data part. Here is one example:
This is our main mail.box nameplate. It contains a line for each letter. Her primary key is the uid mid pair. There is also an array of lids tags, because one letter can have more than one tag. At the same time, there is a task to answer the question "give me all the letters with such and such a label." Obviously, this requires some kind of index. If you build a B-Tree index on an array, then it will not answer such a question. To do this, we use the tricky functional index gin for the uid and lids field. It allows us to answer the question "give me all the letters of such and such a user with such and such tags or with such and such a tag."
- Since Oracle had a lot of pain with stored logic, we swore that there would be no stored logic in PostgreSQL at all. But in the process of our experiments and prototypes, we realized that PL / pgSQL is very good . It does not suffer from a library cache problem, and we did not find other highly critical problems.
- At the same time, the amount of logic was greatly reduced , leaving only what is needed for the logical integrity of the data. For example, if you put a letter, then increase the counter in the folder label.
- Since there is no undo, the price of the error has become much higher . In undo, we climbed a couple of times after laying out the bad code, about this my colleague Alexander made a separate report at our meeting .
- Due to the lack of library cache, it is much easier to deploy . We ride a couple of times a week instead of once a quarter, as it was before.
- Since we changed the hardware and began to scale horizontally, we changed the approach to servicing the bases. Bases we now steer SaltStack . His main killer feature for us is the ability to see a detailed diff between what is now on the base and what we expect from it. If the observed is satisfied, then the person presses the “roll out” button, and it rolls.
- We now change the scheme and code through migrations . We had a separate report about this .
- We left manual service, all that is possible, automated . Switching wizards, transferring users, loading new shards, and so on - all this with a button and very simple.
- Since deploying a new database is one button, we got representative test environments for development. Each developer has a base, two, as he wants - this is very convenient.
Such things never go smoothly.
This is a list of threads in the community with problems that we could not solve on our own.
- Problem with ExclusiveLock on inserts
- Checkpoint distribution
- ExclusiveLock on extension of relation with huge shared_buffers
- Hanging startup process on the replica after vacuuming on master
- Replication slots and isolation levels
- Segfault in BackendIdGetTransactions
That is, we went to the community and helped us. It was a test of what to do when you do not have enterprise support: there is a community, and it works. And it’s very cool. Of course, we solved a lot more problems ourselves.
For example, we had a very popular joke here: "In any incomprehensible situation, autovacuum is to blame." We also solved these problems.
We really lacked PostgreSQL diagnostic methods. The guys from Postgres Pro got us a wait interface. I already talked about this at PG Day in 2015, Peter. There you can read how it works. With the help of guys from Postgres Pro and EnterpriseDB, it entered the 9.6 kernel. Not all, but some of these developments were included in 9.6. Further this functionality will improve. In 9.6, columns appeared that make it possible to better understand what is happening in the database.
Surprise. We encountered a problem with backups. We have a recovery window of 7 days, that is, we should be able to recover at any time in the past over the past 7 days. In Oracle, the size of the space for all backups and archive logs was approximately the size of the database. The base is 15 terabytes - and its backup in 7 days takes 15 terabytes.
In PostgreSQL, we use barman, and in it, backups need at least 5 times more space than the size of the database. Because WALs are compressed, but there are no backups, there are File-level increments that do not really work, in general everything is single-threaded and very slow. If we backed up as 300 terabytes of meta-data as is, we would need about 2 petabytes for backups. Let me remind you that the entire mail storage is 20 petabytes. That is, 10% we would have to cut off only for meta-database backups over the past 7 days, which is a pretty bad plan.
We didn’t come up with anything better and patched barman, here is the pull request . Almost a year has passed since we ask them to cut down this killer feature, and they ask us for money to freeze it. Very arrogant guys. My colleague Eugene, who washed it all down,talked about it on PGday in 2016 . It really shakes backups much better, speeds them up, there are honest increments.
From the experience of the experiment, prototype, and other databases that had appeared on PostgreSQL by that time, we expected a bunch of rakes during the transfer. But there were none. There were many problems, but they were not connected with PostgreSQL, which was surprising for us. It was full of data problems, because over 10 years a lot of all legacy has accumulated. Suddenly it turned out that in some databases the data was encoded in KOI8-R, or other strange things. Of course, there were errors in the transfer logic, so the data also had to be repaired.
There are things we really miss in PostgreSQL.
For example, partitioning to move old data from SSD to SATA. We lack a good built-in recovery manager in order not to use the fork of batman, because it probably will never reach the barman kernel. We are already tired: almost a year we kick them, but they are not in a hurry. It seems that this should not be aloof from PostgreSQL, namely in the kernel.
We will develop a wait interface . I think in the 10th version a quourum commit will happen , there the patch is in good condition. We also really want the normal work with the disk . In terms of disk I / O, PostgreSQL loses much to Oracle.
What is the result? Considering replica raids, we have more than 1 petabyte in PostgreSQL. I recently thought there were just over 500 billion lines. There flies 250 thousand requests per second. In total, it took us 3 calendar years, but we spent more than 10 person-years. That is, the effort of the whole team is quite impressive.
What did we get? Deployment became faster, despite the fact that the bases became much larger, and the number of DBAs decreased. DBA for this project is now less than when Oracle was.
Whether we wanted it or not, we had to refactor the entire backend code. All that legacy that has accumulated over the years has been cut. Our code is cleaner now, which is very good.
There is no tar without spoons. We now have 3 times more hardware under PostgreSQL, but this is nothing compared to the cost of Oracle. So far we have not had large fakaps.
A little comment from me. In “Mail” we use many open source libraries, projects and turnkey solutions. To the three chairs on which we sat tight, which we have almost everywhere - Linux, nginx, postfix - PostgreSQL was added. Now we use it for many bases in other projects. We liked him. The fourth is a good, reliable chair. I think this is a success story.
That's all. Thanks!
Vladimir Borodin - Yandex.Mail success story with PostgreSQL