it_man December 21, 2015 at 11:31

How the team of Airbnb engineers “broke” the main project database in a couple of weeks

Transfer

In our blog on Habré, we like to sort out interesting cases related to the practical side of using startup virtual infrastructure. In addition, we pay attention to foreign experience - we analyze everything related to the operation of complex IT systems, infrastructure and hardware.

For example, recently we told:

about how Spotify scales the Apache Storm ,
considered hardware for deep learning
and fluently talked about an example of optimizing the bandwidth in Ethernet networks .

Today we came across an Airbnb technology blog and decided to talk about the experience of this very famous company. According to engineers, every year the traffic of their service grows 3.5 times, and its peak falls on the summer period. This fact is certainly pleasing to the bosses - business is booming, but it also poses new challenges for technologists. / photo OuiShare CC Airbnb provides an online platform for hosting, searching and short-term rental of private housing around the world. It would seem a fairly primitive service. Why are there any cloud technologies and performance optimization?

It’s clear that the answer is simple - the multimillion-dollar audience of the service. In addition, connection of all new regions, which means the constant need to scale IT infrastructure on the fly. All this experience is collected in the company's IT blog .

One of the tasks that we personally liked was the scaling of the database. Willy Yao, one of the engineers, talked about how the company was preparing for the summer peak of the load (which is quite logical and understandable, summer is the most convenient season for traveling).

As is usually the case in creative and “live” teams, a solution was found that could theoretically save several weeks of staff work. The bottom line was to use replication in MySQL to ensure data integrity. The task in such a situation always consists in not creating unnecessary work for programmers and not wasting time on data migration.

It is worth noting that the Airbnb blog has repeatedly said that the team uses a vertical breakdown of the functions performed in order to distribute the load and eliminate possible failures. For each independent Java and Rails service, they have their own dedicated database, each of which runs on its own instance of RDS.

/ photo Sebastiaan ter Burg CC

The rapid growth of the startup still affected the IT component - a huge amount of data remained in the original database, left over from the time when Airbnb was a monolithic application on Rails. In addition, the last breakdown of the database was three years ago, which made it difficult to repeat the procedure with the current data volumes.

As a result, it was decided to take advantage of MySQL replication to simplify the design process and spend a minimum of effort on it. Such a move is a proven technique.

The team was also helped by the fact that the MySQL database is based on the Amazon RDS service, so creating new copies that are readable (read replicas) and putting the copy in stand-alone master server mode is relatively easy.

Here it was decided to create new replicas and block the ability to write to certain tables in order to preserve data integrity.

To prepare for the transfer, a query analyzer was used. Its main task is to preserve the integrity and correctness of the existing queries using cross-links between tables.

According to the plan, it was necessary to arrange all the database names in special data pipelines, so in the end the database decided not to rename - the names of the old and new coincided.

Next, it was necessary to understand how a simple incoming message service (up to 10 minutes) would affect the work of customer support. For such maneuvers, I had to choose the least busy time. The general plan was approximately as follows:

1) make changes to the requests for incoming messages - changing the database host in the next step does not require making changes - there are tools for updating the settings;

2) redirect all traffic with requests to record incoming messages to the message wizard;

3) on the main master server, delete all connections to the message database;

4) check if everything is ready for replication;

5) carry out the conversion of the message wizard (about 3.5 minutes);

6) deployment on the updated message wizard before subsequent automatic backup to RDS;

7) delete unnecessary tables in the corresponding databases

As a result, Airbnb engineers received a noticeable decrease in the number of records in the main database on the master server. The implementation of the project itself took about two weeks. During this time, no more than seven 30-second downtimes of the incoming message service occurred, and the size of the main database was reduced by 20%.

An even more important outcome of the project was the achievement of stability of the main database, which was achieved by reducing the number of requests for data recording by 33%.

Tags:

How the team of Airbnb engineers “broke” the main project database in a couple of weeks

Also popular now: