MVCC-4. Data snapshots

    Having examined the issues related to isolation , and made a digression about organizing data at a low level , the last time we talked in detail about row versions and traced how service information in the version header changed during various operations.

    Today we look at how consistent versions of data are obtained from row versions.

    What is a data snapshot


    Physically, data pages may contain several versions of the same row. Moreover, each transaction should see only one (or not a single) version of each row so that together they form an ACID-consistent picture of the data at a certain point in time.

    Isolation in PostgreSQL is built on the basis of snapshots: each transaction works with its own data snapshot, which “contains” the data that was recorded before the snapshot was created and does not “contain” the data not yet fixed at that moment. We have already seen that the isolation in this case is more stringent than the standard requires, but not without anomalies.

    At the isolation level of Read Committed, a snapshot is created at the beginning of each transaction statement. Such a snapshot is active while the statement is being executed. In the figure, the moment of creating the snapshot (which, as we recall, is determined by the transaction number) is shown in blue.



    At the Repeatable Read and Serializable levels, a snapshot is created once at the beginning of the first transaction statement. Such a snapshot remains active until the very end of the transaction.



    Visibility of row versions in snapshot


    Visibility Rules


    Of course, a snapshot is not a physical copy of all required row versions. In fact, a snapshot is specified by several numbers, and the visibility of the row versions in the snapshot is determined by the rules.

    Whether or not this version of the line is visible in the snapshot depends on the two fields of its header - xmin and xmax - that is, on the numbers of the transactions that created and deleted them. Such intervals do not intersect, therefore, one line is represented in any picture by a maximum of one of its versions.

    The exact rules of visibility are quite complex and take into account many different situations and extreme cases.
    This can be easily verified by looking at src / backend / utils / time / tqual.c (in version 12, the check moved to src / backend / access / heap / heapam_visibility.c).

    To simplify, we can say that the version of the line is visible when the changes made by the xmin transaction are visible in the picture, and the changes made by the xmax transaction are not visible (in other words, it is already visible that the version of the line appeared, but it is not yet visible that it was deleted).

    In turn, transaction changes are visible in the snapshot, if either this is the same transaction that created the snapshot (she sees her own changes), or the transaction was committed before the snapshot was created.

    You can depict transactions graphically in the form of segments (from the start to the time of commit):



    Here:

    • changes to transaction 2 will be visible because it completed before the snapshot was created,
    • changes to transaction 1 will not be visible because it was active at the time the snapshot was taken,
    • changes to transaction 3 will not be visible because it started after the snapshot was taken (it doesn’t matter if it ended or not).

    Unfortunately, the moment of committing transactions is unknown to the system. Only the moment of its beginning is known (it is determined by the transaction number and is indicated by a dashed line in the figures above), but the fact of completion is not recorded anywhere.

    All we can do is find out the current status of transactions when creating a snapshot. This information is in the server’s shared memory in the ProcArray structure, which contains a list of all active sessions and their transactions.

    And after the fact, we will no longer be able to understand whether any transaction was active at the time the snapshot was created or not. Therefore, the list of all currently active transactions must be remembered in the picture.

    It follows from the above that in PostgreSQL you cannot create a snapshot showing consistent data as of an arbitrary time ago, even ifall the rows necessary for this version exist in the tabular pages. One often hears the question why there are no retrospective (or temporal; in Oracle, this is called flashback query) queries in PostgreSQL - this is one of the reasons.
    It's funny that initially there was such functionality, but later it was removed from the DBMS. You can read about this in an article by Joseph Hellerstein .
    So, a data snapshot is determined by several parameters:

    • the moment of creating the snapshot, namely, the number of the next transaction that does not exist in the system ( snapshot.xmax );
    • a list of active transactions at the time the snapshot was taken ( snapshot.xip ).

    For convenience and optimization, the number of the earliest active transaction ( snapshot.xmin ) is also stored separately . This value has an important meaning, which we will discuss below.

    Also, a few more parameters are saved in the picture, but they are not important for us.



    Example


    To see how visibility is determined by the snapshot, we reproduce the situation with the three transactions discussed above. The table will have three rows, with:

    • the first one was added by a transaction that started before the snapshot was created, but ended later,
    • the second is added by a transaction that started and ended before the snapshot was created,
    • the third was added after taking the picture.

    => TRUNCATE TABLE accounts;
    

    First transaction (not yet completed):

    => BEGIN;
    => INSERT INTO accounts VALUES (1, '1001', 'alice', 1000.00);
    => SELECT txid_current();
    
    => SELECT txid_current();
     txid_current 
    --------------
             3695
    (1 row)
    

    The second transaction (completed before the snapshot was created):

    |  => BEGIN;
    |  => INSERT INTO accounts VALUES (2, '2001', 'bob', 100.00);
    |  => SELECT txid_current();
    
    |   txid_current 
    |  --------------
    |           3696
    |  (1 row)
    
    |  => COMMIT;
    

    Create a snapshot in a transaction in another session.

    ||    => BEGIN ISOLATION LEVEL REPEATABLE READ;
    ||    => SELECT xmin, xmax, * FROM accounts;
    
    ||     xmin | xmax | id | number | client | amount 
    ||    ------+------+----+--------+--------+--------
    ||     3696 |    0 |  2 | 2001   | bob    | 100.00
    ||    (1 row)
    

    We complete the first transaction after the snapshot is created:

    => COMMIT;
    

    And the third transaction (appeared later on the snapshot):

    |  => BEGIN;
    |  => INSERT INTO accounts VALUES (3, '2002', 'bob', 900.00);
    |  => SELECT txid_current();
    
    |   txid_current 
    |  --------------
    |           3697
    |  (1 row)
    
    |  => COMMIT;
    

    Obviously, one line is still visible in our image:

    ||    => SELECT xmin, xmax, * FROM accounts;
    
    ||     xmin | xmax | id | number | client | amount 
    ||    ------+------+----+--------+--------+--------
    ||     3696 |    0 |  2 | 2001   | bob    | 100.00
    ||    (1 row)
    

    The question is how PostgreSQL understands this.

    Everything is determined by the picture. Let's look at it:

    ||    => SELECT txid_current_snapshot();
    
    ||     txid_current_snapshot 
    ||    -----------------------
    ||     3695:3697:3695
    ||    (1 row)
    

    Here, the colon lists snapshot.xmin, snapshot.xmax and snapshot.xip (in this case, one number, but in general - a list).

    According to the rules stated above, the image should show the changes made by transactions with the numbers snapshot.xmin <= xid <snapshot.xmax, with the exception of snapshot.xip. Let's look at all the rows of the table (in a new picture):

    => SELECT xmin, xmax, * FROM accounts ORDER BY id;
    
     xmin | xmax | id | number | client | amount  
    ------+------+----+--------+--------+---------
     3695 |    0 |  1 | 1001   | alice  | 1000.00
     3696 |    0 |  2 | 2001   | bob    |  100.00
     3697 |    0 |  3 | 2002   | bob    |  900.00
    (3 rows)
    

    The first line is not visible - it was created by a transaction, which is included in the list of active (xip).
    The second line is visible - it is created by a transaction that falls within the range of the image.
    The third row is not visible - it was created by a transaction that is not in the range of the snapshot.

    ||    => COMMIT;
    

    Own changes


    Somewhat complicates the picture is the case of determining the visibility of your own transaction changes. Here you may need to see only a part of such changes. For example, a cursor open at a particular moment should not see changes made after that moment at any isolation level.

    To do this, in the header of the row version there is a special field (which is displayed in the pseudo-columns cmin and cmax), showing the sequence number of the operation inside the transaction. Cmin represents the number to insert, cmax represents the number to delete, but to save space in the line header, this is actually one field, not two different. It is believed that inserting and deleting the same row in a single transaction is rare.

    If this still happens, then a special “combo” number is inserted into the same field, about which the service process remembers the real cmin and cmax. But this is completely exotic.

    A simple example. We start the transaction and add the line to the table:

    => BEGIN;
    => SELECT txid_current();
    
     txid_current 
    --------------
             3698
    (1 row)
    
    INSERT INTO accounts(id, number, client, amount) VALUES (4, 3001, 'charlie', 100.00);
    

    We will display the contents of the table along with the cmin field (but only for the rows added by our transaction - for others it does not make sense):

    => SELECT xmin, CASE WHEN xmin = 3698 THEN cmin END cmin, * FROM accounts;
    
     xmin | cmin | id | number | client  | amount  
    ------+------+----+--------+---------+---------
     3695 |      |  1 | 1001   | alice   | 1000.00
     3696 |      |  2 | 2001   | bob     |  100.00
     3697 |      |  3 | 2002   | bob     |  900.00
     3698 |    0 |  4 | 3001   | charlie |  100.00
    (4 rows)
    

    Now open the cursor for the query that returns the number of rows in the table.

    => DECLARE c CURSOR FOR SELECT count(*) FROM accounts;
    

    And after that add another line:

    => INSERT INTO accounts(id, number, client, amount) VALUES (5, 3002, 'charlie', 200.00);
    

    The request will return 4 - the line added after the cursor was opened will not fall into the data snapshot:

    => FETCH c;
    
     count 
    -------
         4
    (1 row)
    

    Why? Because in the snapshot only line versions with cmin <1 are taken into account.

    => SELECT xmin, CASE WHEN xmin = 3698 THEN cmin END cmin, * FROM accounts;
    
     xmin | cmin | id | number | client  | amount  
    ------+------+----+--------+---------+---------
     3695 |      |  1 | 1001   | alice   | 1000.00
     3696 |      |  2 | 2001   | bob     |  100.00
     3697 |      |  3 | 2002   | bob     |  900.00
     3698 |    0 |  4 | 3001   | charlie |  100.00
     3698 |    1 |  5 | 3002   | charlie |  200.00
    (5 rows)
    
    => ROLLBACK;
    

    Event horizon


    The number of the earliest active transaction (snapshot.xmin) has an important meaning - it defines the "event horizon" of the transaction. Namely, beyond its horizon, a transaction always sees only current versions of rows.

    Indeed, an irrelevant version needs to be seen only if the current one was created by a transaction that has not yet been completed, and therefore is not yet visible. But beyond the "horizon" all transactions are already guaranteed to be completed.



    The "event horizon" of the transaction can be seen in the system directory:

    => BEGIN;
    => SELECT backend_xmin FROM pg_stat_activity WHERE pid = pg_backend_pid();
    
     backend_xmin 
    --------------
             3699
    (1 row)
    

    You can also define an “event horizon” at the database level. To do this, take all the active snapshots and among them find the oldest xmin. It will determine the horizon beyond which irrelevant versions of rows in this database will never be visible to any transactions. Such versions of strings can be cleared - which is why the concept of horizon is so important from a practical point of view.

    If a transaction holds a snapshot for a long time, it will also hold the database event horizon. Moreover, an unfinished transaction will hold the horizon by the very fact of its existence, even if a snapshot is not held in it.

    And this means that irrelevant versions of rows in this database cannot be cleared. At the same time, a “long-playing” transaction may not overlap with other transactions in the data - this is absolutely not important, the database horizon is the same for everyone.

    If, now, snapshots (from snapshot.xmin to snapshot.xmax) are not represented as a segment, then the situation can be imagined as follows:



    In this figure, the bottommost snapshot refers to an incomplete transaction, and snapshot.xmin cannot be in the rest of the snapshots more than her number.

    In our example, a transaction with Read Committed isolation level was started. Even though there is no active data snapshot in it, it continues to hold the horizon:

    |  => BEGIN;
    |  => UPDATE accounts SET amount = amount + 1.00;
    |  => COMMIT;
    
    => SELECT backend_xmin FROM pg_stat_activity WHERE pid = pg_backend_pid();
    
     backend_xmin 
    --------------
             3699
    (1 row)
    

    And only after the transaction is completed, the horizon moves forward, allowing you to clear irrelevant versions of the rows:

    => COMMIT;
    => SELECT backend_xmin FROM pg_stat_activity WHERE pid = pg_backend_pid();
    
     backend_xmin 
    --------------
             3700
    (1 row)
    

    If the described situation really creates problems and there is no way to avoid it at the application level, then, starting with version 9.6, two options are available:

    • old_snapshot_threshold defines the maximum lifetime of a snapshot. After this time, the server gains the right to delete irrelevant versions of the rows, and if they need a “long-playing” transaction, but it will receive a snapshot too old error.
    • idle_in_transaction_session_timeout defines the maximum lifetime of an inactive transaction. After this time, the transaction is aborted.

    Export data snapshot


    There are situations when several concurrent transactions must be guaranteed to see the same data picture. As an example, we can use the pg_dump utility, which can work in parallel mode: all worker processes must see the database in the same state so that the backup copy is consistent.

    Of course, you cannot rely on the fact that the data patterns coincide simply because the transactions are launched “simultaneously”. There is a mechanism for exporting and importing a snapshot for this.

    The pg_export_snapshot function returns the identifier of a snapshot that can be transferred (by external means to the DBMS) to another transaction.

    => BEGIN ISOLATION LEVEL REPEATABLE READ;
    => SELECT count(*) FROM accounts; -- любой запрос
    
     count 
    -------
         3
    (1 row)
    
    => SELECT pg_export_snapshot();
    
     pg_export_snapshot  
    ---------------------
     00000004-00000E7B-1
    (1 row)
    

    Another transaction can import the snapshot using the SET TRANSACTION SNAPSHOT command before executing the first request in it. You must first set the isolation level as Repeatable Read or Serializable, because at the Read Committed level, operators will use their own snapshots.

    |  => DELETE FROM accounts;
    |  => BEGIN ISOLATION LEVEL REPEATABLE READ;
    |  => SET TRANSACTION SNAPSHOT '00000004-00000E7B-1';
    

    Now the second transaction will work with a snapshot of the first and, accordingly, see three rows (and not zero):

    |  => SELECT count(*) FROM accounts;
    
    |   count 
    |  -------
    |       3
    |  (1 row)
    

    The lifetime of the exported snapshot is the lifetime of the exporting transaction.

    |    => COMMIT;
    => COMMIT;
    

    To be continued .

    Also popular now: