erogov April 6, 2019 at 00:24

MVCC-3. Row versions

So, we looked at issues related to isolation and made a digression about organizing data at a low level . And finally got to the most interesting - to the version of the lines.

Headline

As we have already said, each row can simultaneously be present in the database in several versions. One version must be distinguished from the other somehow. For this purpose, each version has two marks that determine the "time" of the action of this version (xmin and xmax). In quotation marks - because it is not time that is used as such, but a special incrementing counter. And this counter is the transaction number.

(As usual, it’s actually more complicated: the transaction number cannot increase all the time due to the limited bit capacity of the counter. But we will consider these details in detail when we get to the freeze.)

When the line is created, xmin is set to the number of the transaction that executed the command INSERT, and xmax is not populated.

When a row is deleted, the xmax value of the current version is marked with the transaction number that performed DELETE.

When a line is modified with the UPDATE command, two operations are actually performed: DELETE and INSERT. In the current version of the line, xmax is set equal to the number of the transaction that performed UPDATE. Then a new version of the same line is created; its xmin value matches the xmax value of the previous version.

The xmin and xmax fields are included in the row version header. In addition to these fields, the header contains others, for example:

infomask - a series of bits defining the properties of this version. There are quite a lot of them; the main ones we will gradually consider.
ctid - a link to the next, newer version of the same line. In the newest, most current version of the string, ctid refers to this version itself. The number has the form (x, y), where x is the page number, y is the serial number of the pointer in the array.
bitmap of undefined values - marks those columns of this version that contain an undefined value (NULL). NULL is not one of the usual values of data types, so the attribute must be stored separately.

As a result, the header is quite large - at least 23 bytes per version of the string, and usually more due to the NULL bitmap. If the table is “narrow” (that is, it contains few columns), overhead may take more than useful information.

Insert

Let's take a closer look at how string operations are performed at a low level, and start with the insert.

For experiments, create a new table with two columns and an index on one of them:

=> CREATE TABLE t(
  id serial,
  s text
);
=> CREATE INDEX ON t(s);

Insert one line, after starting the transaction.

=> BEGIN;
=> INSERT INTO t(s) VALUES ('FOO');

Here is the number of our current transaction:

=> SELECT txid_current();

 txid_current 
--------------
         3664
(1 row)

Take a look at the contents of the page. The heap_page_items function of the pageinspect extension provides information about pointers and row versions:

=> SELECT * FROM heap_page_items(get_raw_page('t',0)) \gx

-[ RECORD 1 ]-------------------
lp          | 1
lp_off      | 8160
lp_flags    | 1
lp_len      | 32
t_xmin      | 3664
t_xmax      | 0
t_field3    | 0
t_ctid      | (0,1)
t_infomask2 | 2
t_infomask  | 2050
t_hoff      | 24
t_bits      | 
t_oid       | 
t_data      | \x0100000009464f4f

Note that the word heap (heap) in PostgreSQL refers to tables. This is another strange use of the term - heap is a well - known data structure that has nothing to do with a table. Here, this word is used in the sense of "everything is piled in a heap," in contrast to ordered indexes.

The function displays data “as is” in a format that is difficult to read. To understand, we will leave only a part of the information and decrypt it:

=> SELECT '(0,'||lp||')' AS ctid,
       CASE lp_flags
         WHEN 0 THEN 'unused'
         WHEN 1 THEN 'normal'
         WHEN 2 THEN 'redirect to '||lp_off
         WHEN 3 THEN 'dead'
       END AS state,
       t_xmin as xmin,
       t_xmax as xmax,
       (t_infomask & 256) > 0  AS xmin_commited,
       (t_infomask & 512) > 0  AS xmin_aborted,
       (t_infomask & 1024) > 0 AS xmax_commited,
       (t_infomask & 2048) > 0 AS xmax_aborted,
       t_ctid
FROM heap_page_items(get_raw_page('t',0)) \gx

-[ RECORD 1 ]-+-------
ctid          | (0,1)
state         | normal
xmin          | 3664
xmax          | 0
xmin_commited | f
xmin_aborted  | f
xmax_commited | f
xmax_aborted  | t
t_ctid        | (0,1)

Here's what we did:

We added a zero to the index number to bring it to the same form as t_ctid: (page number, index number).
Decrypted the state of the lp_flags pointer. Here it is “normal” - this means that the pointer really refers to the version of the string. Other values will be considered later.
Of all the information bits, so far only two pairs have been allocated. The xmin_committed and xmin_aborted bits indicate whether the transaction with the xmin number is committed (canceled). Two similar bits refer to transaction number xmax.

What do we see? When you insert a row in the table page, a pointer appears with the number 1, referring to the first and only version of the row.

In the version of the line, the xmin field is filled with the number of the current transaction. The transaction is still active, so both bits xmin_committed and xmin_aborted are not set.

The ctid field of the row version refers to the same row. This means that a newer version does not exist.

The xmax field is filled with a dummy number 0, because this version of the line is not deleted and is relevant. Transactions will not pay attention to this number, because the xmax_aborted bit is set.

Let's take one more step to improve readability by adding information bits to transaction numbers. And we will create a function, since we will need the request more than once:

=> CREATE FUNCTION heap_page(relname text, pageno integer)
RETURNS TABLE(ctid tid, state text, xmin text, xmax text, t_ctid tid)
AS $$
SELECT (pageno,lp)::text::tid AS ctid,
       CASE lp_flags
         WHEN 0 THEN 'unused'
         WHEN 1 THEN 'normal'
         WHEN 2 THEN 'redirect to '||lp_off
         WHEN 3 THEN 'dead'
       END AS state,
       t_xmin || CASE
         WHEN (t_infomask & 256) > 0 THEN ' (c)'
         WHEN (t_infomask & 512) > 0 THEN ' (a)'
         ELSE ''
       END AS xmin,
       t_xmax || CASE
         WHEN (t_infomask & 1024) > 0 THEN ' (c)'
         WHEN (t_infomask & 2048) > 0 THEN ' (a)'
         ELSE ''
       END AS xmax,
       t_ctid
FROM heap_page_items(get_raw_page(relname,pageno))
ORDER BY lp;
$$ LANGUAGE SQL;

In this form, it’s much clearer what is happening in the header of the version of the string:

=> SELECT * FROM heap_page('t',0);

 ctid  | state  | xmin | xmax  | t_ctid 
-------+--------+------+-------+--------
 (0,1) | normal | 3664 | 0 (a) | (0,1)
(1 row)

Similar, but substantially less detailed, information can be obtained from the table itself, using the pseudo-columns xmin and xmax:

=> SELECT xmin, xmax, * FROM t;

 xmin | xmax | id |  s  
------+------+----+-----
 3664 |    0 |  1 | FOO
(1 row)

Fixation

Upon successful completion of the transaction, you need to remember its status - note that it is fixed. To do this, use a structure called XACT (and before version 10 it was called CLOG (commit log) and this name can still be found in different places).

XACT is not a system catalog table; These are the files in the PGDATA / pg_xact directory. In them, for each transaction, two bits are allocated: committed and aborted - exactly the same as in the header of the version of the line. This information is divided into several files solely for convenience, we will return to this issue when we consider freezing. And work with these files is carried out page by page, as with all others.

So, when committing a transaction in XACT, the committed bit is set for this transaction. And that’s all that happens during the commit (though we are not talking about the prerecord journal yet).

When any other transaction accesses the table page we just looked at, she will have to answer a few questions.

Has transaction xmin completed? If not, the generated version of the string should not be visible.
This check is performed by looking at yet another structure that is located in the shared memory of the instance and is called ProcArray. It contains a list of all active processes, and for each the number of its current (active) transaction is indicated.
If completed, then how - by fixation or cancellation? If canceled, then the version of the string should not be visible either.
This is exactly what XACT is for. But, although the last XACT pages are stored in buffers in RAM, it is unnecessary to check the XACT each time. Therefore, the status of a transaction once clarified is recorded in the xmin_committed and xmin_aborted bits of the row version. If one of these bits is set, then the state of transaction xmin is considered known and the next transaction will no longer have to access XACT.

Why are these bits not set by the transaction itself that performs the insertion? When an insert occurs, the transaction does not yet know whether it will complete successfully. And at the moment of fixing it is already not clear which lines in which pages were changed. There may be many such pages, and memorizing them is disadvantageous. In addition, part of the pages can be pushed out of the buffer cache to disk; reading them again to change the bits would mean significantly slowing down the commit.

The downside of the savings is that after the changes, any transaction (even performing a simple read - SELECT) can start changing data pages in the buffer cache.

So, fix the change.

=> COMMIT;

Nothing has changed in the page (but we know that the transaction status is already recorded in XACT):

=> SELECT * FROM heap_page('t',0);

 ctid  | state  | xmin | xmax  | t_ctid 
-------+--------+------+-------+--------
 (0,1) | normal | 3664 | 0 (a) | (0,1)
(1 row)

Now the transaction that first accesses the page will have to determine the transaction status xmin and write it to the information bits:

=> SELECT * FROM t;

 id |  s  
----+-----
  1 | FOO
(1 row)

=> SELECT * FROM heap_page('t',0);

 ctid  | state  |   xmin   | xmax  | t_ctid 
-------+--------+----------+-------+--------
 (0,1) | normal | 3664 (c) | 0 (a) | (0,1)
(1 row)

Delete

When a line is deleted, the number of the current deleting transaction is recorded in the xmax field of the current version, and the xmax_aborted bit is reset.

Note that the set xmax value corresponding to the active transaction acts as a row lock. If another transaction is about to update or delete this row, it will be forced to wait for the xmax transaction to complete. We will talk more about locks later. For now, we only note that the number of row locks is unlimited. They do not occupy a place in RAM and system performance does not suffer from their quantity. True, “long” transactions have other disadvantages, but more on that later.

Delete the line.

=> BEGIN;
=> DELETE FROM t;
=> SELECT txid_current();

 txid_current 
--------------
         3665
(1 row)

We see that the transaction number is recorded in the xmax field, but the information bits are not set:

=> SELECT * FROM heap_page('t',0);

 ctid  | state  |   xmin   | xmax | t_ctid 
-------+--------+----------+------+--------
 (0,1) | normal | 3664 (c) | 3665 | (0,1)
(1 row)

Cancel

Reverting changes works similarly to commit, only in XACT for the transaction the aborted bit is set. Cancellation is as fast as commit. Although the command is called ROLLBACK, the change is not rolled back: everything that the transaction managed to change in the data pages remains unchanged.

=> ROLLBACK;
=> SELECT * FROM heap_page('t',0);

 ctid  | state  |   xmin   | xmax | t_ctid 
-------+--------+----------+------+--------
 (0,1) | normal | 3664 (c) | 3665 | (0,1)
(1 row)

When accessing the page, the status will be checked and the xmax_aborted hint bit will be set in the version of the line. The xmax number itself remains in the page, but no one will look at it.

=> SELECT * FROM t;

 id |  s  
----+-----
  1 | FOO
(1 row)

=> SELECT * FROM heap_page('t',0);

 ctid  | state  |   xmin   |   xmax   | t_ctid 
-------+--------+----------+----------+--------
 (0,1) | normal | 3664 (c) | 3665 (a) | (0,1)
(1 row)

Update

The update works as if it was first deleting the current version of the row, and then inserting a new one.

=> BEGIN;
=> UPDATE t SET s = 'BAR';
=> SELECT txid_current();

 txid_current 
--------------
         3666
(1 row)

The request produces one line (new version):

=> SELECT * FROM t;

 id |  s  
----+-----
  1 | BAR
(1 row)

But in the page we see both versions:

=> SELECT * FROM heap_page('t',0);

 ctid  | state  |   xmin   | xmax  | t_ctid 
-------+--------+----------+-------+--------
 (0,1) | normal | 3664 (c) | 3666  | (0,2)
 (0,2) | normal | 3666     | 0 (a) | (0,2)
(2 rows)

The remote version is marked with the current transaction number in the xmax field. Moreover, this value is written over the old one, since the previous transaction was canceled. And the xmax_aborted bit is reset, because the status of the current transaction is still unknown.

The first version of the line now refers to the second (t_ctid field), as a newer one.

A second pointer and a second line appear in the index page, linking to the second version in the table page.

As with deletion, the xmax value in the first version of the string is a sign that the string is locked.

Well, complete the transaction.

=> COMMIT;

Indices

So far, we have only talked about tabular pages. And what happens inside the indices?

The information in the index pages is highly dependent on the particular type of index. And even one type of index has different types of pages. For example, the B-tree has a page with metadata and “regular” pages.

However, a page usually has an array of pointers to the lines and the lines themselves (just like in a table page). In addition, at the end of the page there is a place for special data.

Rows in indexes can also have a very different structure depending on the type of index. For example, for a B-tree, rows related to leaf pages contain the value of the index key and a link (ctid) to the corresponding row of the table. In general, an index can be arranged in a completely different way.

The most important point is that there are no row versions in any type of index. Well, or we can assume that each line is represented by exactly one version. In other words, there are no xmin and xmax fields in the header of the index row. We can assume that the links from the index lead to all tabular versions of the rows - so you can only figure out which version the transaction will see if you look at the table. (As usual, this is not the whole truth. In some cases, the visibility map allows you to optimize the process, but we will examine it in more detail later.)

At the same time, in the index page we find pointers to both versions, both current and old:

=> SELECT itemoffset, ctid FROM bt_page_items('t_s_idx',1);

 itemoffset | ctid  
------------+-------
          1 | (0,2)
          2 | (0,1)
(2 rows)

Virtual transactions

In practice, PostgreSQL uses optimizations to “save” transaction numbers.

If a transaction only reads data, then it does not affect the visibility of row versions. Therefore, at first, the serving process issues a virtual number (virtual xid) transaction. The number consists of a process identifier and a sequential number.

The issuance of this number does not require synchronization between all processes and therefore is very fast. We will get to know another reason for using virtual numbers when we talk about freezing.

Virtual numbers are not taken into account in data snapshots.

At different points in time, virtual transactions with numbers that have already been used may well appear in the system, and this is normal. But such a number cannot be written to data pages, because the next time you access the page, it may lose all meaning.

=> BEGIN;
=> SELECT txid_current_if_assigned();

 txid_current_if_assigned 
--------------------------
(1 row)

If the transaction begins to change data, it is given a real, unique transaction number.

=> UPDATE accounts SET amount = amount - 1.00;
=> SELECT txid_current_if_assigned();

 txid_current_if_assigned 
--------------------------
                     3667
(1 row)

=> COMMIT;

Nested Transactions

Save points

In SQL are defined in terms of conservation (savepoint), that allow you to cancel the operation of the transaction, without interrupting it completely. But this does not fit into the above scheme, since the status of a transaction is one for all its changes, and physically no data is rolled back.

To implement such functionality, a transaction with a save point is divided into several separate nested transactions (subtransaction), the status of which can be controlled separately.

Nested transactions have their own number (higher than the main transaction number). The status of nested transactions is recorded in the usual way in XACT, however, the final status depends on the status of the main transaction: if it is canceled, then all nested transactions are also canceled.

Information on transaction nesting is stored in files in the PGDATA / pg_subtrans directory. Files are accessed through buffers in the shared memory of the instance, organized in the same way as XACT buffers.

Do not confuse nested transactions and autonomous transactions. Autonomous transactions are in no way dependent on each other, and nested ones are dependent. There are no autonomous transactions in the usual PostgreSQL, and, perhaps, for the better: in the case they are needed very, very rarely, and their presence in other DBMSs provokes abuse, from which everyone then suffers.

Clear the table, start the transaction and insert the line:

=> TRUNCATE TABLE t;
=> BEGIN;
=> INSERT INTO t(s) VALUES ('FOO');
=> SELECT txid_current();

 txid_current 
--------------
         3669
(1 row)

=> SELECT xmin, xmax, * FROM t;

 xmin | xmax | id |  s  
------+------+----+-----
 3669 |    0 |  2 | FOO
(1 row)

=> SELECT * FROM heap_page('t',0);

 ctid  | state  | xmin | xmax  | t_ctid 
-------+--------+------+-------+--------
 (0,1) | normal | 3669 | 0 (a) | (0,1)
(1 row)

Now put a savepoint and insert another row.

=> SAVEPOINT sp;
=> INSERT INTO t(s) VALUES ('XYZ');
=> SELECT txid_current();

 txid_current 
--------------
         3669
(1 row)

Note that the txid_current () function returns the number of the main, not nested, transaction.

=> SELECT xmin, xmax, * FROM t;

 xmin | xmax | id |  s  
------+------+----+-----
 3669 |    0 |  2 | FOO
 3670 |    0 |  3 | XYZ
(2 rows)

=> SELECT * FROM heap_page('t',0);

 ctid  | state  | xmin | xmax  | t_ctid 
-------+--------+------+-------+--------
 (0,1) | normal | 3669 | 0 (a) | (0,1)
 (0,2) | normal | 3670 | 0 (a) | (0,2)
(2 rows)

We roll back to the save point and insert the third row.

=> ROLLBACK TO sp;
=> INSERT INTO t VALUES ('BAR');
=> SELECT xmin, xmax, * FROM t;

 xmin | xmax | id |  s  
------+------+----+-----
 3669 |    0 |  2 | FOO
 3671 |    0 |  4 | BAR
(2 rows)

=> SELECT * FROM heap_page('t',0);

 ctid  | state  |   xmin   | xmax  | t_ctid 
-------+--------+----------+-------+--------
 (0,1) | normal | 3669     | 0 (a) | (0,1)
 (0,2) | normal | 3670 (a) | 0 (a) | (0,2)
 (0,3) | normal | 3671     | 0 (a) | (0,3)
(3 rows)

In the page, we continue to see the row added by the canceled nested transaction.

We fix the changes.

=> COMMIT;
=> SELECT xmin, xmax, * FROM t;

 xmin | xmax | id |  s  
------+------+----+-----
 3669 |    0 |  2 | FOO
 3671 |    0 |  4 | BAR
(2 rows)

=> SELECT * FROM heap_page('t',0);

 ctid  | state  |   xmin   | xmax  | t_ctid 
-------+--------+----------+-------+--------
 (0,1) | normal | 3669 (c) | 0 (a) | (0,1)
 (0,2) | normal | 3670 (a) | 0 (a) | (0,2)
 (0,3) | normal | 3671 (c) | 0 (a) | (0,3)
(3 rows)

Now you can clearly see that each nested transaction has its own status.

Note that nested transactions cannot be used explicitly in SQL, that is, you cannot start a new transaction without completing the current one. This mechanism is used implicitly when using savepoints, and also when handling PL / pgSQL exceptions and in a number of other, more exotic cases.

=> BEGIN;

BEGIN

=> BEGIN;

WARNING:  there is already a transaction in progress
BEGIN

=> COMMIT;

COMMIT

=> COMMIT;

WARNING:  there is no transaction in progress
COMMIT

Errors and atomicity of operations

What happens if an error occurs during the operation? For example, like this:

=> BEGIN;
=> SELECT * FROM t;

 id |  s  
----+-----
  2 | FOO
  4 | BAR
(2 rows)

=> UPDATE t SET s = repeat('X', 1/(id-4));

ERROR:  division by zero

An error has occurred. Now the transaction is considered aborted and not a single operation is allowed in it:

=> SELECT * FROM t;

ERROR:  current transaction is aborted, commands ignored until end of transaction block

And even if you try to commit the changes, PostgreSQL will report the cancellation:

=> COMMIT;

ROLLBACK

Why can't I continue the transaction after a failure? The fact is that an error could occur so that we would gain access to part of the changes - the atomicity of not even the transaction, but the operator would be violated. As in our example, where the operator managed to update one line before the error:

=> SELECT * FROM heap_page('t',0);

 ctid  | state  |   xmin   | xmax  | t_ctid 
-------+--------+----------+-------+--------
 (0,1) | normal | 3669 (c) | 3672  | (0,4)
 (0,2) | normal | 3670 (a) | 0 (a) | (0,2)
 (0,3) | normal | 3671 (c) | 0 (a) | (0,3)
 (0,4) | normal | 3672     | 0 (a) | (0,4)
(4 rows)

I must say that in psql there is a mode that still allows you to continue the transaction after a failure, as if the actions of the erroneous operator are rolled back.

=> \set ON_ERROR_ROLLBACK on
=> BEGIN;
=> SELECT * FROM t;

 id |  s  
----+-----
  2 | FOO
  4 | BAR
(2 rows)

=> UPDATE t SET s = repeat('X', 1/(id-4));

ERROR:  division by zero

=> SELECT * FROM t;

 id |  s  
----+-----
  2 | FOO
  4 | BAR
(2 rows)

=> COMMIT;

It is easy to guess that in this mode psql actually sets an implicit save point in front of each command, and in case of failure it initiates a rollback to it. This mode is not used by default, since setting save points (even without rolling back to them) is associated with significant overhead.

To be continued.

Tags: