How we migrated the database from Redis and Riak KV to PostgreSQL. Part 1: the process
This is the first part of the article in which I will talk about how we built the process of working on a large database migration project: about safe experiments, team planning, and cross-team interaction. In the following articles I’ll talk in more detail about the technical problems that we solved: about scaling and fault tolerance of PostgreSQL and load testing.
For a long time, the main database in Miro (ex-RealtimeBoard) was Redis. We stored in it all the basic information: data about users, accounts, boards, etc. Everything worked quickly, but we ran into a number of problems.
Problems with Redis
These problems, together with an increase in the amount of data on the servers, served as the reason for the database migration.
The decision on migration has been made. The next step is to understand which database is suitable for our data model.
We conducted a study to select the optimal database for us, and settled on PostgreSQL. Our data model fits well with a relational database: PostgreSQL has built-in tools to ensure data consistency, there is a JSONB type and the ability to index certain fields in JSONB. It suits us.
The simplified architecture of our application looked like this: there are Application Servers that access Redis and RiakKV through the data layer.
Our Application Server is a monolithic Java application. The business logic is written in a framework that is adapted for NoSQL. The application has its own transactional system, which allows you to provide multiple users on any of our boards.
We used RiakKV to store data from archive boards that did not open for 7 days.
Add PostgreSQL to this schema. We make Application servers work with the new database. Copy data from Redis and RiakKV to PostgreSQL. The problem is solved!
Nothing complicated, but there are nuances:
We faced a choice of two options for data migration:
Stopping the development of a service is a loss of time that we could use for growth, which means a loss of users and market share. This is critical for us, so we chose the option with smooth migration. Despite the fact that in complexity this process can be compared to replacing wheels on a car while driving.
When evaluating the work, we divided our product into main blocks: users, accounts, boards, and so on. Separately, work was carried out to create the PostgreSQL infrastructure. And they put risks in the assessment in case something goes wrong (the way it happened).
The next step is to build a team of five people so that everyone moves at the right speed to a common goal.
We have two points: the beginning of work on the task and the final goal. Ideal when we are moving towards the goal in a direct way. But it often happens that we want to go the straight way, but it turns out like this:
For example, due to difficulties and problems that could not be foreseen in advance.
A situation is possible in which we will not reach the goal at all. For example, if we go into deep refactoring or rewriting the entire application.
We split the task into weekly sprints to minimize the difficulties described above. If the team suddenly leaves to the side, it can quickly go back with minimal losses for the project, since short iterations do not allow you to go too far "wrong way".
Each iteration has its own goal, which moves the team to the final big result.
If a new task appears during the sprint, we evaluate whether its implementation brings us closer to the goal. Yes - take the next sprint or change priorities in the current one, if not - do not take it. If errors appear, we give them high priority and quickly fix them.
It happens that developers within a sprint must perform tasks in a strictly defined sequence. Or, for example, the developer hands over the finished task to the QA engineer for urgent testing. At the planning stage, we try to build similar relationships between tasks for each team member. This allows the whole team to see who will do what, and when, not forgetting about dependence on others.
The team has daily and weekly syncs. Every morning, we discuss who, what, and in what priority will do today. After each sprint, we synchronize with each other to be sure that everyone is moving in the right direction. Be sure to plan for large or complex releases. We appoint on-duty developers who, if necessary, are present during the release and monitor that everything is in order.
Planning and synchronization within the team allow involving all participants in all stages of the project. Plans and evaluations do not come to us from above, we make them ourselves. This increases the responsibility and interest of the team in completing tasks.
This is one of our sprints. We carry everything on the Miro board:
During the migration, we had to guarantee the stable operation of the service in combat conditions. To do this, you need to be sure that everything is tested and there are no errors anywhere. To achieve this goal, we decided to make our smooth migration even smoother.
The idea was to gradually switch product blocks to a new database. To do this, we came up with a sequence of modes.
In the first “Redis Read / Write” mode, only the old database, Redis, works.
In the second “PostgreSQL Passive Write” mode, we can make sure that writing to the new database is correct and the databases are consistent.
Third mode “PostgreSQL Read / Write, Redis Passive Write”allows you to verify the correctness of reading data from PostgreSQL and see how the new database behaves in combat conditions. At the same time, Redis remains the main base, which enabled us to find specific cases of working with boards that could lead to errors.
In the last “PostgreSQL Read / Write” mode, only the new database is running.
Migration work could affect the main functions of the product, so we had to be 100% sure that we won’t break anything and that the new database works at least as slowly as the old one. Therefore, we began to conduct safe experiments with switching modes.
We started switching modes on our corporate account, which we use daily in work. After we made sure that there were no errors in it, we started switching modes on a small selection of external users.
Timeline of launching experiments with the modes is as follows:
If errors occurred, we had the opportunity to quickly fix them, because we ourselves could make releases on servers where the users participating in the experiment worked. We did not depend on the main release, so we fixed errors quickly and at any time.
During the migration, we often intersected with teams that released new features. We have a single code base, and as part of their work, teams could change existing structures in a new database or create new ones. At the same time, intersections of teams for the development and withdrawal of new features could occur. For example, one of the product teams promised the marketing team to release a new feature by a specific date; the marketing team has planned an advertising campaign for this period; A sales team is waiting for a feature and campaign to start communicating with new customers. It turns out that everyone depends on each other, and delaying the deadlines by one team disrupts the plans of the other.
To avoid such situations, we, together with other teams, compiled a single grocery roadmap, which was synchronized several times a quarter, and with some teams weekly.
What we learned during this project:
In the following articles I will talk more about the technical problems that we solved during the migration.
For a long time, the main database in Miro (ex-RealtimeBoard) was Redis. We stored in it all the basic information: data about users, accounts, boards, etc. Everything worked quickly, but we ran into a number of problems.
Problems with Redis
- Dependence on network latency. Now in our cloud it is about 20 Moscow time, but when you increase it, the application will start to work very slowly.
- The lack of indexes that we need at the level of business logic. Their independent implementation can complicate the business logic and lead to data inconsistency.
- Code complexity also makes it difficult to maintain data consistency.
- Resource intensity of queries with selections.
These problems, together with an increase in the amount of data on the servers, served as the reason for the database migration.
Formulation of the problem
The decision on migration has been made. The next step is to understand which database is suitable for our data model.
We conducted a study to select the optimal database for us, and settled on PostgreSQL. Our data model fits well with a relational database: PostgreSQL has built-in tools to ensure data consistency, there is a JSONB type and the ability to index certain fields in JSONB. It suits us.
The simplified architecture of our application looked like this: there are Application Servers that access Redis and RiakKV through the data layer.
Our Application Server is a monolithic Java application. The business logic is written in a framework that is adapted for NoSQL. The application has its own transactional system, which allows you to provide multiple users on any of our boards.
We used RiakKV to store data from archive boards that did not open for 7 days.
Add PostgreSQL to this schema. We make Application servers work with the new database. Copy data from Redis and RiakKV to PostgreSQL. The problem is solved!
Nothing complicated, but there are nuances:
- We have 2.2 million registered users. Every day, Miro employs 50 thousand users, the peak load is up to 14 thousand at the same time. Users should not encounter errors due to our work, they generally should not notice the moment of moving to a new base.
- 1 TB of data in the database or 410 million objects.
- Continuous release of new features by other teams, whose work we should not interfere.
Options for solving the problem
We faced a choice of two options for data migration:
- Stop the development of the service → rewrite the code on the server → test the functionality → launch a new version.
- Perform a smooth migration: gradually transfer parts of the product to a new database, supporting both PostgreSQL and Redis and not interrupting the development of new features.
Stopping the development of a service is a loss of time that we could use for growth, which means a loss of users and market share. This is critical for us, so we chose the option with smooth migration. Despite the fact that in complexity this process can be compared to replacing wheels on a car while driving.
When evaluating the work, we divided our product into main blocks: users, accounts, boards, and so on. Separately, work was carried out to create the PostgreSQL infrastructure. And they put risks in the assessment in case something goes wrong (the way it happened).
Sprints and Goals
The next step is to build a team of five people so that everyone moves at the right speed to a common goal.
We have two points: the beginning of work on the task and the final goal. Ideal when we are moving towards the goal in a direct way. But it often happens that we want to go the straight way, but it turns out like this:
For example, due to difficulties and problems that could not be foreseen in advance.
A situation is possible in which we will not reach the goal at all. For example, if we go into deep refactoring or rewriting the entire application.
We split the task into weekly sprints to minimize the difficulties described above. If the team suddenly leaves to the side, it can quickly go back with minimal losses for the project, since short iterations do not allow you to go too far "wrong way".
Each iteration has its own goal, which moves the team to the final big result.
If a new task appears during the sprint, we evaluate whether its implementation brings us closer to the goal. Yes - take the next sprint or change priorities in the current one, if not - do not take it. If errors appear, we give them high priority and quickly fix them.
It happens that developers within a sprint must perform tasks in a strictly defined sequence. Or, for example, the developer hands over the finished task to the QA engineer for urgent testing. At the planning stage, we try to build similar relationships between tasks for each team member. This allows the whole team to see who will do what, and when, not forgetting about dependence on others.
The team has daily and weekly syncs. Every morning, we discuss who, what, and in what priority will do today. After each sprint, we synchronize with each other to be sure that everyone is moving in the right direction. Be sure to plan for large or complex releases. We appoint on-duty developers who, if necessary, are present during the release and monitor that everything is in order.
Planning and synchronization within the team allow involving all participants in all stages of the project. Plans and evaluations do not come to us from above, we make them ourselves. This increases the responsibility and interest of the team in completing tasks.
This is one of our sprints. We carry everything on the Miro board:
Modes and Safe Experiments
During the migration, we had to guarantee the stable operation of the service in combat conditions. To do this, you need to be sure that everything is tested and there are no errors anywhere. To achieve this goal, we decided to make our smooth migration even smoother.
The idea was to gradually switch product blocks to a new database. To do this, we came up with a sequence of modes.
In the first “Redis Read / Write” mode, only the old database, Redis, works.
In the second “PostgreSQL Passive Write” mode, we can make sure that writing to the new database is correct and the databases are consistent.
Third mode “PostgreSQL Read / Write, Redis Passive Write”allows you to verify the correctness of reading data from PostgreSQL and see how the new database behaves in combat conditions. At the same time, Redis remains the main base, which enabled us to find specific cases of working with boards that could lead to errors.
In the last “PostgreSQL Read / Write” mode, only the new database is running.
Migration work could affect the main functions of the product, so we had to be 100% sure that we won’t break anything and that the new database works at least as slowly as the old one. Therefore, we began to conduct safe experiments with switching modes.
We started switching modes on our corporate account, which we use daily in work. After we made sure that there were no errors in it, we started switching modes on a small selection of external users.
Timeline of launching experiments with the modes is as follows:
- January-February: Redis read / write
- March-April: PostgreSQL passive write
- May-June: PostgreSQL read / write, main database - Redis
- July-August: PostgreSQL read / write
- September-December: full migration.
If errors occurred, we had the opportunity to quickly fix them, because we ourselves could make releases on servers where the users participating in the experiment worked. We did not depend on the main release, so we fixed errors quickly and at any time.
Cross-team collaboration
During the migration, we often intersected with teams that released new features. We have a single code base, and as part of their work, teams could change existing structures in a new database or create new ones. At the same time, intersections of teams for the development and withdrawal of new features could occur. For example, one of the product teams promised the marketing team to release a new feature by a specific date; the marketing team has planned an advertising campaign for this period; A sales team is waiting for a feature and campaign to start communicating with new customers. It turns out that everyone depends on each other, and delaying the deadlines by one team disrupts the plans of the other.
To avoid such situations, we, together with other teams, compiled a single grocery roadmap, which was synchronized several times a quarter, and with some teams weekly.
conclusions
What we learned during this project:
- Do not be afraid to take on complex projects. After decomposition, evaluation and development of approaches to work, complex projects cease to seem impossible.
- Do not spare time and effort on preliminary estimates, decomposition and planning. This helps to understand the problem deeper before you start working on it, and to understand the volume and complexity of the work.
- Lay risks in difficult technical and organizational projects. In the process of work, you will surely encounter a problem that was not taken into account when planning.
- Do not migrate unless necessary.
In the following articles I will talk more about the technical problems that we solved during the migration.