Migrating a non-downstream database schema for postgresql using django as an example

    Introduction


    Hi, Habr!


    I want to share the experience of writing migrations for postgres and django. This is mainly about postgres, django is a good complement here, because out of the box there is an automatic migration of data schema for model changes, that is, it has a fairly complete list of work operations for changing the schema. Django can be replaced with any favorite framework / library - approaches are likely to be similar.


    I will not describe how I came to this, but now, reading the documentation, I catch the thought that it was necessary to do this with greater care and awareness earlier, therefore I highly recommend it.


    Before going further, let me make the following assumptions.


    You can divide the logic of working with the database of most applications into 3 parts:


    1. Migrations - changing the database schema (tables), suppose we always run them in one thread.
    2. Business logic - direct work with data (in user tables), works with the same data constantly and competitively.
    3. Data migrations - do not change the data schema, they work essentially as a business logic, by default, when we talk about business logic, we also mean data migration.

    Downtime is a state when part of our business logic is not available / is falling / loading for a noticeable time for the user, suppose it is a couple of seconds.


    Lack of downtime can be a critical condition for business, which should be adhered to by any efforts.


    Rolling out process


    The main requirements when rolling out:


    1. we have one working base.
    2. we have several machines where business logic is spinning.
    3. Business logic machines are hidden behind the balancer.
    4. Our application works well before, during and after the migration process (the old code works correctly with the old and new base schemas).
    5. our application works well before, during, and after updating the code on machines (the old and new code works correctly with the current base scheme).

    If there are a large number of changes and roll-out ceases to meet these conditions, then it is divided into the necessary number of smaller roll-outs that meet these conditions, otherwise we have downtime.


    Direct vykatki order:


    1. flooded migration;
    2. removed one machine from the balancer, updated the machine and restarted, returned the machine to the balancer;
    3. repeated the previous step before updating all the machines.

    Reverse rollout order is relevant for deleting tables and columns in a table, when we automatically create migrations using a modified scheme and validate the presence of all migrations to CI:


    1. removed one machine from the balancer, updated the machine and restarted, returned the machine to the balancer;
    2. repeated the previous step before updating all the machines;
    3. flooded migration.

    Theory


    Postgres is an excellent database, we can write an application that in hundreds and thousands of threads will write and read the same data, and with a high probability to be sure that our data will remain valid and will not be damaged, in general, full ACID. Postgres implements several mechanisms to achieve this, one of which is locking.


    In postgres there are several types of locks, more details can be found here , in the framework of the topic, I will only touch on blocking at the table and record level.


    Table level locks


    At the table level, postgres has several types of locks , the main feature is that they have conflicts, that is, two operations with conflicting locks cannot be performed simultaneously:


    ACCESS SHAREROW SHAREROW EXCLUSIVESHARE UPDATE EXCLUSIVESHARESHARE ROW EXCLUSIVEEXCLUSIVEACCESS EXCLUSIVE
    ACCESS SHAREX
    ROW SHAREXX
    ROW EXCLUSIVEXXXX
    SHARE UPDATE EXCLUSIVEXXXXX
    SHAREXXXXX
    SHARE ROW EXCLUSIVEXXXXXX
    EXCLUSIVEXXXXXXX
    ACCESS EXCLUSIVEXXXXXXXX

    For example, ALTER TABLE tablename ADD COLUMN newcolumn integerand SELECT COUNT(*) FROM tablenamemust be strictly executed one by one, otherwise we cannot find out which columns to return to COUNT(*).


    The django migrations (full list below) contains the following operations and the corresponding locks:


    blockingoperations
    ACCESS EXCLUSIVECREATE SEQUENCE, DROP SEQUENCE, CREATE TABLE, DROP TABLE, ALTER TABLE,DROP INDEX
    SHARECREATE INDEX
    SHARE UPDATE EXCLUSIVECREATE INDEX CONCURRENTLY, DROP INDEX CONCURRENTLY,ALTER TABLE VALIDATE CONSTRAINT

    Not all of the comments ALTER TABLEhave ACCESS EXCLUSIVEa lock, in the absence of migration django CREATE INDEX CONCURRENTLYand ALTER TABLE VALIDATE CONSTRAINT, but they will need for a safer alternative to standard operations later.


    If the migrations are performed in one stream sequentially, then everything looks fine, since the migration will not conflict with another migration, but our business logic will work just during the migration and conflict.


    blockingoperationsconflicts with locksconflicts with operations
    ACCESS SHARESELECTACCESS EXCLUSIVEALTER TABLE, DROP INDEX
    ROW SHARESELECT FOR UPDATEACCESS EXCLUSIVE, EXCLUSIVEALTER TABLE, DROP INDEX
    ROW EXCLUSIVEINSERT, UPDATE,DELETEACCESS EXCLUSIVE, EXCLUSIVE, SHARE ROW EXCLUSIVE,SHAREALTER TABLE, DROP INDEX,CREATE INDEX

    Here you can summarize two points:


    1. if there is an alternative to the lighter lock - you can use it as CREATE INDEXand CREATE INDEX CONCURRENTLY.
    2. most migrations of data schema changes conflict with business logic, and conflict with ACCESS EXCLUSIVE, that is, we cannot even do so SELECTwhile holding this lock and potentially downtime here, except if this operation does not work immediately and our downtime is a couple of seconds.

    There must be a choice, either we always avoid ACCESS EXCLUSIVE, that is, we create new tablets and copy data there — reliably, but for a long time for a large amount of data, or we make it ACCESS EXCLUSIVEas fast as possible and make additional warnings against downtime — potentially dangerous, but quickly.


    Write level locks


    The record level also has its own locks https://www.postgresql.org/docs/current/static/explicit-locking.html#LOCKING-ROWS , they also conflict with each other, but affect only our business logic:


    FOR KEY SHAREFOR SHAREFOR NO KEY UPDATEFOR UPDATE
    FOR KEY SHAREX
    FOR SHAREXX
    FOR NO KEY UPDATEXXX
    FOR UPDATEXXXX

    Here is the main point in data migrations, that is, if we do data UPDATEon the entire table in the data migration , the rest of the business logic that updates the data will wait for the lock to be released and may exceed our downtime threshold, so it’s better for the data migrations to do updates in parts. It is also worth noting that when using more complex sql queries for data migrations, partitioning can work faster, as it can use a more optimal plan and indexes.


    The sequence of operations


    Another important knowledge is how operations will be performed, when and how they take and release locks:


    image


    Here you can highlight the following items:


    1. the operation execution time - for migration, this is the block retention time, if a heavy lock is held for a long time - we will have downtime, for example, this may be at CREATE INDEXor ALTER TABLE ADD COLUMN SET DEFAULT(in postgres 11 this is better).
    2. waiting time for conflicting locks - that is, migration waits for all conflicting requests to work out, but at this time new requests will wait for our migration, slow requests can be very dangerous as they are just not optimal or analytical, so slow requests should not be migration.
    3. number of requests per second - if we have a lot of requests for a long time, then free connections can quickly end and instead of one problematic place, the entire database can go downtime (only the connection limit for the superuser remains), here you need to avoid slow requests, reduce the number of requests For example, to start migrations during the minimum load, to separate critical components into different services with their own databases.
    4. many migrations in one transaction - the more transactions in one transaction, the longer a heavy lock is held, so it is better to separate heavy operations, no ALTER TABLE VALIDATE CONSTRAINTdata migrations in a single transaction with a heavy lock.

    Timeouts


    Postgres has such settings as lock_timeoutwell as statement_timeout, which can secure the launch of migrations, both from poorly written migration, and from bad conditions in which the migration can run. Can be installed both globally and for the current connection.


    SET lock_timeout TO '2s'Avoid downtime while waiting for slow requests / transactions before migration: https://www.postgresql.org/docs/current/static/runtime-config-client.html#GUC-LOCK-TIMEOUT .


    SET statement_timeout TO '2s'avoids downtime when starting a heavy migration with heavy blocking: https://www.postgresql.org/docs/current/static/runtime-config-client.html#GUC-STATEMENT-TIMEOUT .


    Deadlocks


    Deadlocks in migrations are not about downtime, but pleasantly enough, when the migration is written, it works fine on the test environment, but it catches the deadlock when you roll on the prod. The main sources of problems can be a large number of operations in one transaction and Foreign Key, since it creates locks in both tables, so it is better to separate the operations of migrations, the atomic - the better.


    Keeping records


    Postgres stores different types of values ​​in different ways : if the types are stored differently, then converting between them will require a complete rewriting of all the values, fortunately some types are stored the same way and do not require rewriting when changing. For example, lines are stored equally regardless of the size and the reduction / increase in the dimension of the line does not require overwriting, but the reduction requires checking that all lines do not exceed the smaller size. Other types can also be stored in a similar manner and have similar features.


    Multiversion Concurrency Control (MVCC)


    According to the documentation , consistency in postgres is based on multi-versioning of data, that is, each transaction and operation sees its version of the data. This feature copes with competitive access, and also has an interesting effect when changing a schema like adding and deleting columns changes only the schema, if there are no additional operations to change data, indexes or constraints, after which low-level inserts and updates will create records with all necessary values, the deletion will mark the corresponding record as deleted. VACUUM or AUTO VACUUM is responsible for cleaning up the remaining debris.


    Django example


    We now have an idea of ​​what downtime can depend on and how it can be avoided, but before you apply the knowledge you can see what django gives out of the box ( https://github.com/django/django/blob/2.1.2/django /db/backends/base/schema.py and https://github.com/django/django/blob/2.1.2/django/db/backends/postgresql/schema.py ):


    operation
    oneCREATE SEQUENCE
    2DROP SEQUENCE
    3CREATE TABLE
    fourDROP TABLE
    fiveALTER TABLE RENAME TO
    6ALTER TABLE SET TABLESPACE
    7ALTER TABLE ADD COLUMN [SET DEFAULT] [SET NOT NULL] [PRIMARY KEY] [UNIQUE]
    eightALTER TABLE ALTER COLUMN [TYPE] [SET NOT NULL|DROP NOT NULL] [SET DEFAULT|DROP DEFAULT]
    9ALTER TABLE DROP COLUMN
    tenALTER TABLE RENAME COLUMN
    elevenALTER TABLE ADD CONSTRAINT CHECK
    12ALTER TABLE DROP CONSTRAINT CHECK
    13ALTER TABLE ADD CONSTRAINT FOREIGN KEY
    14ALTER TABLE DROP CONSTRAINT FOREIGN KEY
    15ALTER TABLE ADD CONSTRAINT PRIMARY KEY
    sixteenALTER TABLE DROP CONSTRAINT PRIMARY KEY
    17ALTER TABLE ADD CONSTRAINT UNIQUE
    18ALTER TABLE DROP CONSTRAINT UNIQUE
    nineteenCREATE INDEX
    20DROP INDEX

    My needs for django migrations are very good, now we can discuss safe and dangerous operations for migrations without downtime with our knowledge.


    Safe we ​​will call migration with SHARE UPDATE EXCLUSIVEblocking or ACCESS EXCLUSIVE, which works out instantly.
    We will call dangerous c migration SHAREand ACCESS EXCLUSIVEblocking, which take considerable time.


    I will leave in advance a useful link to the documentation with great examples.


    Creating and deleting a table


    CREATE SEQUENCE, DROP SEQUENCE, CREATE TABLE, DROP TABLEYou can be called safe, because the business logic, or more, or is no longer working with the table that you are migrating, the behavior of the removal table FOREIGN KEY will be a little later.


    Heavily supported operations on worksheets


    ALTER TABLE RENAME TO - I can’t call it safe, because it’s hard to write logic that works with such a table before and after migration.


    ALTER TABLE SET TABLESPACE - unsafe, since it physically moves the plate, and this can be long on a large volume.


    On the other hand, these operations are rather rare, as an alternative you can suggest creating a new table and copying data into it.


    Creating and deleting a column


    ALTER TABLE ADD COLUMN, ALTER TABLE DROP COLUMN- can be called safe (creation without DEFAULT / NOT NULL / PRIMARY KEY / UNIQUE), since business logic either does not work or does not work with the column being migrated, the behavior of deleting a column with FOREIGN KEY, other counters and indexes will be later.


    ALTER TABLE ADD COLUMN SET DEFAULT, ALTER TABLE ADD COLUMN SET NOT NULL, ALTER TABLE ADD COLUMN PRIMARY KEY, ALTER TABLE ADD COLUMN UNIQUE- unsafe operation because the added column and releasing the lock data updated defaults or create konstreynty alternatively creating nullable column and further change.


    It is worth mentioning faster SET DEFAULTin postgres 11, it can be viewed as safe, but it does not become very useful in django, as django uses SET DEFAULTonly to fill the column and then does DROP DEFAULT, and in the gap between migration and updating of business logic machines, records can be created , which default will be absent, that is, then still do the data migration.


    Heavily supported operations on a worksheet


    ALTER TABLE RENAME COLUMN- I also cannot call it safe, since it is difficult to write logic that works with such a column before and after migration. Rather, this operation will not be frequent either, as an alternative you can suggest creating a new column and copying data into it.


    Column change


    ALTER TABLE ALTER COLUMN TYPE- the operation can be both dangerous and safe. Safe if postgres only changes the schema, and the data is already stored in the correct format and no additional type checks are needed, for example:


    • change type from varchar(LESS)to varchar(MORE);
    • change type from varchar(ANY)to text;
    • type change from numeric(LESS, SAME)c numeric(MORE, SAME).

    ALTER TABLE ALTER COLUMN SET NOT NULL- dangerous, because it passes through the data inside and checks for NULL, fortunately this can be replaced by another CHECK IS NOT NULL. It is worth noting that this replacement will lead to a different scheme, but with identical properties.


    ALTER TABLE ALTER COLUMN DROP NOT NULL, ALTER TABLE ALTER COLUMN SET DEFAULT, ALTER TABLE ALTER COLUMN DROP DEFAULT- safe operation.


    Creating and deleting indexes and constructs


    ALTER TABLE ADD CONSTRAINT CHECKand ALTER TABLE ADD CONSTRAINT FOREIGN KEY- unsafe operations, but they can be declared as NOT VALIDand then done ALTER TABLE VALIDATE CONSTRAINT.


    ALTER TABLE ADD CONSTRAINT PRIMARY KEYand ALTER TABLE ADD CONSTRAINT UNIQUEunsafe, since inside they create a unique index, but you can create a unique index like CONCURRENTLY, then create the corresponding count using an already prepared index, through USING INDEX.


    CREATE INDEXIs an unsafe operation, but an index can be created as CONCURRENTLY.


    ALTER TABLE DROP CONSTRAINT CHECK, ALTER TABLE DROP CONSTRAINT FOREIGN KEY, ALTER TABLE DROP CONSTRAINT PRIMARY KEY, ALTER TABLE DROP CONSTRAINT UNIQUE, DROP INDEX- safe operation.


    It is worth noting that ALTER TABLE ADD CONSTRAINT FOREIGN KEYthey ALTER TABLE DROP CONSTRAINT FOREIGN KEYmake locking of two tables at once.


    We apply knowledge in django


    Django has an operation in migrations to execute any SQL: https://docs.djangoproject.com/en/2.1/ref/migration-operations/#django.db.migrations.operations.RunSQL . Through it, you can set the necessary timeouts and apply alternative operations for migrations, with an indication of the state_operationsmigration that we are replacing.


    This works well for your code, although it requires additional scribbling, but you can leave the dirty work on db backend, for example, https://github.com/tbicr/django-pg-zero-downtime-migrations/blob/master/django_zero_downtime_migrations_postgres_backend/schema .py collect the described practices and replace unsafe operations with safe counterparts, and this will work for third-party libraries.


    At last


    These practices allowed me to get an identical schema created by django out of the box, with the exception of replacing a count CHECK IS NOT NULLinstead of NOT NULLsome name (for example ALTER TABLE ADD COLUMN UNIQUEand alternative). Another compromise may be the lack of transactionality for alternative migration operations, especially where CREATE INDEX CONCURRENTLYand are involved ALTER TABLE VALIDATE CONSTRAINT.


    If you do not go beyond postgres, then there are many options for changing the data scheme, and they can be combined in various ways for specific conditions:


    • use jsonb as schamaless solution
    • the opportunity to go to downtime
    • requirement to do migration without downtime

    In any case, I hope that the material was useful either to increase uptime, or to expand consciousness.

    Only registered users can participate in the survey. Sign in , please.

    How do you migrate?


    Also popular now: