Big data backup

    When designing and operating our data warehouse, the question arose several times how to make backups or replication. I always gave the same answer to him - no way. I will explain a little why.

    Backups of large databases (from hundreds of gagabytes and above) are a rather useless task for one simple reason: restoring from a backup can take days. If the database is used constantly for business and data is loaded into it in a continuous stream, this is unacceptable. The situation is somewhat better in the case of incremental backup to the backup system, which can be enabled directly on top of the backup. However, this method is not suitable for all databases, but only for those that do not change files once written to disk. For example, for MySQL this method is not suitable, all tables are either in a single tablespace (InnoDB), or in separate files (MyISAM). For Vertika, this is a possible option, since the data is written in impersonal files, which do not change after recording, but only are deleted. However, in the case of cluster systems, it is necessary to ensure the identical topology of the primary and backup systems. There may also be problems with data integrity in the event of a failure of the primary system.

    Sometimes replication is used to maintain the backup system. But you need to understand that replication drains performance pretty much, since it requires writing a binary log, and if replication is synchronous, then synchronization is necessary. In analytical applications with a large flow of data, when you need to constantly load thousands or tens of thousands of records per second into the database, this may not be acceptable.

    What to do? Quite a long time ago, we came up with and implemented a mechanism for cloning or multiplexing systems, which occurs not at the database level, but at the source level. We support several “almost” identical systems that are not interconnected, but load the same data in the same way. Since users never directly write anything directly to analytical databases, this can be done. Such cloning has another important advantage: you can have one or more test systems with real combat data and load. Another advantage is staging deployment and QA. The behavior of the system with the new functionality can be compared with the current combat, and catch errors on time.

    So, cloning makes it possible:
    • Have a permanently ready live backup system or several
    • Have identical systems for different purposes or for load distribution
    • Have systems with the same data but different settings (for example, optimized for light or heavy queries)
    • Have a test system with combat data and load
    • Conduct a gradual deployment of new functionality, reducing the risk of errors
    • Restore one system with data from another (copy)
    • Transparently manage it all

    And all this without performance penalties, and with minimal risk. However, there are difficulties that should be mentioned:
    • Monitoring data integrity between systems
    • Starting a new system from scratch

    Both of these problems are quite difficult to solve with 100% accuracy, but we did not need it. It is enough that financially significant statistics coincide, and detailed data may slightly vary or even be absent. In both cases, data synchronization can be done by copying meaningful data with a live system. Thus, we always had complete control and space to choose whether we want to get an exact copy, but in a few days, or less accurate, but faster.

    The described approach has helped us many times. Moreover, he allowed us to have systems on different databases (different vendors), but working on the same algorithms. Thus simplifying migration from one database to another.

    Update: After receiving comments, there was a need for some clarification.


    Probably, it was worth starting and writing about what kind of data we process. I am sure that the proposed approach works in different cases, but the specifics will not hurt.

    We process and load data from several types of sources into the warehouse. It:
    • Logs from runtime servers registering statistics and context of ad campaign impressions
    • Ontology and description allowing to interpret logs correctly
    • Data from the sites of our partners

    All this data is uploaded to the repository, and is used by our customers, partners, own users, and various algorithms that make decisions on the basis of what, where and how to display. A database failure means not only stopping the business and losing money, but also that you will then have to “catch up” the data accumulated during the failure. And efficiency is very significant. Therefore, the question of the backup system is not idle.

    The amount of data is large. Raw logs take up a few terabytes per day, although much less in processed form. The combat base is growing steadily, and now takes several terabytes.

    Cloning or multiplexing at the source level is, in our opinion, a good, simple and relatively inexpensive way to solve the backup problem, which also has a number of advantages. The purpose of this article was to describe a working idea, not a specific implementation. The applicability of this idea is quite universal in those cases when data in the store is downloaded only through ETL.


    Also popular now: