Database update and zero-downtime deployment

    A lot of articles have been written about updating systems “on the fly” without stopping them (zero-downtime deployment), and many aspects of this approach are quite obvious. In my opinion, the most difficult part of deployment in this case is the update of data warehouses, if their contract (scheme) has changed. This is the aspect I would like to consider in this article.

    Whatever the database - with an explicit data schema as relational or arbitrary, like NoSQL - the data schema is still present, even at the application level. The data read from the database should be clear to the client, even if the storage itself does not impose any restrictions on their structure.

    Suppose a system with a certain data structure and terabytes of data in the database is already running in production. In the new version of the system, we need to change the structure slightly to implement new features or improve performance. Consider what changes in the scheme can occur:

    • Adding a new field
    • Deleting a field
    • Rename field
    • Field type changes
    • Field transfer to another data structure (for example, in case of denormalization)

    Adding a new field as well as adding any other database object is a backward-compatible change and does not require any additional steps in terms of implementing a zero-downtime deployment (with one proviso - if this new field or object does not depend functionally on the others already stored in the database data). It is enough just to apply the changes in the database “on the fly”, after which the new version of the code that uses the new database objects is applied.

    Deleting a field or any other database object is not a backward-compatible change, but the approach to its implementation is very simple - unnecessary database objects should be removed only after the new version of the system is fully installed.

    The remaining three types of changes are more complex in terms of providing a zero-downtime deployment. In general, all of them can be performed by copying data to other fields / entities, and deleting the “old” data after successfully completing data migration: to rename, you can copy data from the old field to the field with the new name, then delete the old field, change the data type can be done along with renaming, etc. Anyway, for some period of time, the database should support both the old and the new contracts. There are at least two ways to make these changes on the fly:

    If the database supports triggers

    1. Create triggers that copy data from the old place to the new one on any change / addition and set them on production.
    2. Apply a data conversion utility that does the same thing, but for all records in the database. Since the triggers are already installed, the utility can do nothing more complicated than just a “fictitious” update of each record (UPDATE table SET field = field ...). A very important point here is that the action of reading data from the old place and writing to the new should be atomic and protected from lost changes. Depending on the structure of the database, you can use either pessimistic locking via SELECT FOR UPDATE or its analogues, or optimistic if the tables have a field with the record version.
    3. After the utility finishes its work (depending on the amount of data and the complexity of the update, the execution time can be calculated in days) you can install a new version of the system that supports the new data scheme. By this time, all the records in the database that existed at the time of launching the utility will be successfully converted, and all new ones that appeared during its operation will also be converted by triggers.
    4. Delete triggers and all fields (or other database objects) that are no longer needed.

    If it is not possible to use triggers (as is the case with many NoSQL solutions)

    1. Create and close a new version of the application (temporary version 1 in the figure), which always reads from the old field, but when writing in this field, it updates both the old and the corresponding new place (in the figure “C”, the old one, “H” - new). Apply this version to all nodes on which application instances run.
    2. Apply a utility that copies data from the old place to the new. As in the case of triggers, measures must be taken to prevent lost changes.
    3. Create another version of the application (temporary version 2), which reads data from the new field, but still writes in two places. This step is necessary because during the sequential update of each node there will still be a gap when instances of a previous version of the application reading the old field work simultaneously with the new one.
    4. Create and at the end of the full scan of the previous one, finalize the final version, which already does not interact with the old field.
    5. Delete old fields.

    The second approach requires the creation and installation of three different versions of the application, which can be very inconvenient and cumbersome. Instead, you can use feature toggling - lay the logic of all three versions into one, but switch the mode depending on the configuration parameter, which ideally could be switched on the fly. Thus, instead of installing each subsequent version, it will be enough to change the value of the parameter (and restart the service if the configuration update is not provided on the fly). After successfully completing the installation of the final version, all the code related to ensuring data migration should be completely removed from the work branch, even if it will be “live” on production until the next system update.

    It is easy to see that ensuring zero downtime when updating a system is a cumbersome and fragile procedure, so it makes sense to bother with it only if there is a corresponding requirement from the business. But even if the requirements for system availability are quite low (for example, 99% per year and the window of the planned system update is 24 hours), the conversion of the data required to install the new version may still take more. Therefore, you need to be prepared in advance for the use of such solutions, if you intend to store large amounts of data.

    An alternative approach would be to deliberately abandon the back-incompatible changes in the database schema, but, unfortunately, in practice it is not always achievable, since it is often the most effective way to improve the performance of data access is schema restructuring.

    Also popular now: