erogov May 14, 2019 at 12:30

MVCC-5. In-page cleaning and HOT

Let me remind you that we examined issues related to isolation , made a digression about organizing data at a low level , and then talked in detail about row versions and how snapshots are obtained from versions .

Today we will deal with two rather closely related issues: intra-page cleaning and HOT-updates . Both mechanisms can be classified as optimizations; they are important, but are almost not covered in user documentation.

In-page cleaning with regular updates

When accessing the page — both during updating and when reading — quick intra-page cleaning can occur if PostgreSQL understands that the page is running out of space. This occurs in two cases.

An update previously performed on this page (UPDATE) did not find enough space to place a new version of the row on the same page. This situation is remembered in the page title, and the next time the page is cleared.
The page is filled more than on fillfactor. In this case, cleaning occurs immediately, without delaying the next time.

Fillfactor is a storage parameter that can be defined for the table (and for the index). PostgreSQL inserts a new row (INSERT) on a page only if this page is less than fill percentor full. The remaining space is reserved for new versions of strings that result from updates (UPDATE). The default value for tables is 100, that is, space is not reserved (and the value for indexes is 90).

In-page cleaning removes versions of lines that are not visible in any image (located beyond the "event horizon" of the database, we talked about this last time), but it works strictly within the same tabular page. Pointers to scrubbed versions of strings are not freed, because they can be referenced from indexes, and the index is another page. In-page cleaning never goes beyond one tabular page, but it is very fast.

For the same reasons, the free space map is not updated; it also saves space for updates, not for inserts. The visibility map is not updated either.

The fact that a page can be cleared when reading means that a read request (SELECT) can cause pages to change. This is another such case, in addition to the previously deferred change of hint bits.

Let's see how this works, using an example. Create a table and indexes on both columns.

=> CREATE TABLE hot(id integer, s char(2000)) WITH (fillfactor = 75);
=> CREATE INDEX hot_id ON hot(id);
=> CREATE INDEX hot_s ON hot(s);

If only latin letters are stored in column s, then each version of the row will occupy 2004 bytes plus 24 bytes of the header. We set the fillfactor storage parameter to 75% - there will be enough space for three lines.

For convenience, we recreate an already familiar function, supplementing the output with two fields:

=> CREATE FUNCTION heap_page(relname text, pageno integer)
RETURNS TABLE(ctid tid, state text, xmin text, xmax text, hhu text, hot text, t_ctid tid)
AS $$
SELECT (pageno,lp)::text::tid AS ctid,
       CASE lp_flags
         WHEN 0 THEN 'unused'
         WHEN 1 THEN 'normal'
         WHEN 2 THEN 'redirect to '||lp_off
         WHEN 3 THEN 'dead'
       END AS state,
       t_xmin || CASE
         WHEN (t_infomask & 256) > 0 THEN ' (c)'
         WHEN (t_infomask & 512) > 0 THEN ' (a)'
         ELSE ''
       END AS xmin,
       t_xmax || CASE
         WHEN (t_infomask & 1024) > 0 THEN ' (c)'
         WHEN (t_infomask & 2048) > 0 THEN ' (a)'
         ELSE ''
       END AS xmax,
       CASE WHEN (t_infomask2 & 16384) > 0 THEN 't' END AS hhu,
       CASE WHEN (t_infomask2 & 32768) > 0 THEN 't' END AS hot,
       t_ctid
FROM heap_page_items(get_raw_page(relname,pageno))
ORDER BY lp;
$$ LANGUAGE SQL;

And let's create a function to look inside the index page:

=> CREATE FUNCTION index_page(relname text, pageno integer)
RETURNS TABLE(itemoffset smallint, ctid tid)
AS $$
SELECT itemoffset,
       ctid
FROM bt_page_items(relname,pageno);
$$ LANGUAGE SQL;

We’ll check how intra-page cleaning works. To do this, insert one line and change it several times:

=> INSERT INTO hot VALUES (1, 'A');
=> UPDATE hot SET s = 'B';
=> UPDATE hot SET s = 'C';
=> UPDATE hot SET s = 'D';

There are four versions of the line in the page:

=> SELECT * FROM heap_page('hot',0);

 ctid  | state  |   xmin   |   xmax   | hhu | hot | t_ctid 
-------+--------+----------+----------+-----+-----+--------
 (0,1) | normal | 3979 (c) | 3980 (c) |     |     | (0,2)
 (0,2) | normal | 3980 (c) | 3981 (c) |     |     | (0,3)
 (0,3) | normal | 3981 (c) | 3982     |     |     | (0,4)
 (0,4) | normal | 3982     | 0 (a)    |     |     | (0,4)
(4 rows)

As expected, we just exceeded the fillfactor threshold. This is indicated by the difference between the pagesize and upper values: it exceeds the threshold of 75% of the page size, which is 6144 bytes.

=> SELECT lower, upper, pagesize FROM page_header(get_raw_page('hot',0));

 lower | upper | pagesize 
-------+-------+----------
    40 |    64 |     8192
(1 row)

So, the next time you access the page, an in-page cleaning should occur. Check this out.

=> UPDATE hot SET s = 'E';
=> SELECT * FROM heap_page('hot',0);

 ctid  | state  |   xmin   | xmax  | hhu | hot | t_ctid 
-------+--------+----------+-------+-----+-----+--------
 (0,1) | dead   |          |       |     |     | 
 (0,2) | dead   |          |       |     |     | 
 (0,3) | dead   |          |       |     |     | 
 (0,4) | normal | 3982 (c) | 3983  |     |     | (0,5)
 (0,5) | normal | 3983     | 0 (a) |     |     | (0,5)
(5 rows)

All irrelevant versions of lines (0,1), (0,2) and (0,3) are cleared; after that, a new version of the line (0.5) is added to the vacated space.

The versions of lines remaining after cleaning are physically shifted to the side of the senior page addresses so that all free space is represented by one continuous fragment. The values of the pointers change accordingly. Thanks to this, there are no problems with fragmentation of free space in the page.

Pointers to deleted versions of strings cannot be freed because they are referenced from an index page. Let's look at the first page of the hot_s index (because the zero is busy with meta-information):

=> SELECT * FROM index_page('hot_s',1);

 itemoffset | ctid  
------------+-------
          1 | (0,1)
          2 | (0,2)
          3 | (0,3)
          4 | (0,4)
          5 | (0,5)
(5 rows)

We will see the same picture in another index:

=> SELECT * FROM index_page('hot_id',1);

 itemoffset | ctid  
------------+-------
          1 | (0,5)
          2 | (0,4)
          3 | (0,3)
          4 | (0,2)
          5 | (0,1)
(5 rows)

You can notice that the pointers to the table rows go here “backwards”, but it doesn’t matter, because in all versions of the rows the same value is id = 1. But in the previous index the pointers are ordered by s values, and this substantially.

With index access, PostgreSQL can get (0,1), (0,2), or (0,3) as the row version identifier. Then he will try to get the corresponding row from the table page, but thanks to the dead status of the pointer, he will find that such a version no longer exists and will ignore it. (In fact, the first time it detects a lack of a version of a table row, PostgreSQL will also change the status of the pointer in the index page so that it does not access the table page again.)

It’s important that intra-page cleaning works only within one tabular page and does not clear index pages.

HOT updates

Why is it bad to keep links to all versions of a string in the index?

Firstly, with any row change, you have to update all indexes created for the table: since a new version has appeared, you must have links to it. And you need to do this in any case, even if fields that are not included in the index change. Obviously, this is not very effective.

Secondly, the indexes accumulate links to historical versions of the string, which then have to be cleared along with the versions themselves (we will look at this a bit later).

Moreover, there is a feature of the implementation of the B-tree in PostgreSQL. If there is not enough space on the index page to insert a new row, the page is divided into two and all data is redistributed between them. This is called a split page. However, when deleting rows, two index pages no longer “stick together” into one. Because of this, the size of the index may not decrease even if a substantial part of the data is deleted.

Naturally, the more indexes are created on the table, the greater difficulties you have to face.

However, if the value of a column that does not belong to any index changes, then there is no point in creating an additional record in the B-tree containing the same key value. This is how optimization works, called the HOT update - the Heap-Only Tuple Update.

With this update, there is only one entry in the index page that refers to the very first version of the row in the table page. And already inside this tabular page a chain of versions is organized:

strings that are changed and included in the chain are marked with the Heap Hot Updated bit;
rows that are not referenced from the index are marked with the Heap Only Tuple bit (that is, “only the tabular version of the row”);
regular linking of string versions through the ctid field is supported.

If, when scanning an index, PostgreSQL gets into a table page and discovers a version marked as Heap Hot Updated, it understands that it does not need to stop and goes further along the entire update chain. Of course, for all versions of strings obtained in this way, visibility is checked before they are returned to the client.

To look at the operation of a HOT update, delete one index and clear the table.

=> DROP INDEX hot_s;
=> TRUNCATE TABLE hot;

Repeat the insert and update the row.

=> INSERT INTO hot VALUES (1, 'A');
=> UPDATE hot SET s = 'B';

Here is what we see in the table page:

=> SELECT * FROM heap_page('hot',0);

 ctid  | state  |   xmin   | xmax  | hhu | hot | t_ctid 
-------+--------+----------+-------+-----+-----+--------
 (0,1) | normal | 3986 (c) | 3987  | t   |     | (0,2)
 (0,2) | normal | 3987     | 0 (a) |     | t   | (0,2)
(2 rows)

In the page there is a chain of changes:

the Heap Hot Updated flag indicates that you need to go along the ctid chain,
the Heap Only Tuple flag indicates that there are no index links to this version of the row.

With further changes, the chain will grow (within the page):

=> UPDATE hot SET s = 'C';
=> UPDATE hot SET s = 'D';
=> SELECT * FROM heap_page('hot',0);

 ctid  | state  |   xmin   |   xmax   | hhu | hot | t_ctid 
-------+--------+----------+----------+-----+-----+--------
 (0,1) | normal | 3986 (c) | 3987 (c) | t   |     | (0,2)
 (0,2) | normal | 3987 (c) | 3988 (c) | t   | t   | (0,3)
 (0,3) | normal | 3988 (c) | 3989     | t   | t   | (0,4)
 (0,4) | normal | 3989     | 0 (a)    |     | t   | (0,4)
(4 rows)

Moreover, in the index there is one single reference to the “head” of the chain:

=> SELECT * FROM index_page('hot_id',1);

 itemoffset | ctid  
------------+-------
          1 | (0,1)
(1 row)

We emphasize that HOT-updates work if the updated fields are not included in any index. Otherwise, in some index there would be a link directly to the new version of the string, which contradicts the idea of this optimization.

Optimization works only within the limits of one page; therefore, an additional bypass of the chain does not require access to other pages and does not impair performance.

In-page cleaning with HOT updates

A special but important case of intra-page cleaning is cleaning during HOT updates.

Like last time, we already exceeded the fillfactor threshold, so the next update should lead to an in-page cleaning. But this time in the page is a chain of updates. The "head" of this HOT chain should always remain in its place, since the index refers to it, and the rest of the pointers can be freed: it is known that they are not referenced from the outside.

In order not to touch the "head", double addressing is used: the pointer to which the index refers - in this case (0,1) - receives the status of "redirect", redirecting to the desired version of the string.

=> UPDATE hot SET s = 'E';
=> SELECT * FROM heap_page('hot',0);

 ctid  |     state     |   xmin   | xmax  | hhu | hot | t_ctid 
-------+---------------+----------+-------+-----+-----+--------
 (0,1) | redirect to 4 |          |       |     |     | 
 (0,2) | normal        | 3990     | 0 (a) |     | t   | (0,2)
 (0,3) | unused        |          |       |     |     | 
 (0,4) | normal        | 3989 (c) | 3990  | t   | t   | (0,2)
(4 rows)

Note that:

versions (0,1), (0,2) and (0,3) were cleared,
The head pointer (0,1) remained, but received the redirect status,
a new version of the line is written in place (0.2), since this version was guaranteed to have no links from indexes and the pointer was freed (unused).

Perform the update a few more times:

=> UPDATE hot SET s = 'F';
=> UPDATE hot SET s = 'G';
=> SELECT * FROM heap_page('hot',0);

 ctid  |     state     |   xmin   |   xmax   | hhu | hot | t_ctid 
-------+---------------+----------+----------+-----+-----+--------
 (0,1) | redirect to 4 |          |          |     |     | 
 (0,2) | normal        | 3990 (c) | 3991 (c) | t   | t   | (0,3)
 (0,3) | normal        | 3991 (c) | 3992     | t   | t   | (0,5)
 (0,4) | normal        | 3989 (c) | 3990 (c) | t   | t   | (0,2)
 (0,5) | normal        | 3992     | 0 (a)    |     | t   | (0,5)
(5 rows)

The following update again causes intra-page cleanup:

=> UPDATE hot SET s = 'H';
=> SELECT * FROM heap_page('hot',0);

 ctid  |     state     |   xmin   | xmax  | hhu | hot | t_ctid 
-------+---------------+----------+-------+-----+-----+--------
 (0,1) | redirect to 5 |          |       |     |     | 
 (0,2) | normal        | 3993     | 0 (a) |     | t   | (0,2)
 (0,3) | unused        |          |       |     |     | 
 (0,4) | unused        |          |       |     |     | 
 (0,5) | normal        | 3992 (c) | 3993  | t   | t   | (0,2)
(5 rows)

Again, some versions are cleared, and the pointer to the "head" is accordingly shifted.

Conclusion: with frequent updates to columns outside the indexes, it might make sense to reduce the fillfactor parameter to reserve some space on the page for updates. Of course, we must take into account that the lower the fillfactor, the more unallocated space remains on the page and, accordingly, the physical size of the table increases.

HOT chain break

If there is not enough free space on the page to post a new version of a line, the chain will break. The version of the line posted on another page will have to make a separate link from the index.

To get this situation, we start a parallel transaction and build a data snapshot in it.

|  => BEGIN ISOLATION LEVEL REPEATABLE READ;
|  => SELECT count(*) FROM hot;

|   count 
|  -------
|       1
|  (1 row)

A snapshot will not clear the version of the lines on the page. Now we perform the update in the first session:

=> UPDATE hot SET s = 'I';
=> UPDATE hot SET s = 'J';
=> UPDATE hot SET s = 'K';
=> SELECT * FROM heap_page('hot',0);

 ctid  |     state     |   xmin   |   xmax   | hhu | hot | t_ctid 
-------+---------------+----------+----------+-----+-----+--------
 (0,1) | redirect to 2 |          |          |     |     | 
 (0,2) | normal        | 3993 (c) | 3994 (c) | t   | t   | (0,3)
 (0,3) | normal        | 3994 (c) | 3995 (c) | t   | t   | (0,4)
 (0,4) | normal        | 3995 (c) | 3996     | t   | t   | (0,5)
 (0,5) | normal        | 3996     | 0 (a)    |     | t   | (0,5)
(5 rows)

The next time the page is refreshed, there will not be enough space on the page, but in-page cleaning will not be able to free anything:

=> UPDATE hot SET s = 'L';

|  => COMMIT; -- снимок больше не нужен

=> SELECT * FROM heap_page('hot',0);

 ctid  |     state     |   xmin   |   xmax   | hhu | hot | t_ctid 
-------+---------------+----------+----------+-----+-----+--------
 (0,1) | redirect to 2 |          |          |     |     | 
 (0,2) | normal        | 3993 (c) | 3994 (c) | t   | t   | (0,3)
 (0,3) | normal        | 3994 (c) | 3995 (c) | t   | t   | (0,4)
 (0,4) | normal        | 3995 (c) | 3996 (c) | t   | t   | (0,5)
 (0,5) | normal        | 3996 (c) | 3997     |     | t   | (1,1)
(5 rows)

In version (0.5) we see a link to (1.1) leading to page 1.

=> SELECT * FROM heap_page('hot',1);

 ctid  | state  | xmin | xmax  | hhu | hot | t_ctid 
-------+--------+------+-------+-----+-----+--------
 (1,1) | normal | 3997 | 0 (a) |     |     | (1,1)
(1 row)

Now there are two rows in the index, each of which points to the beginning of its HOT chain:

=> SELECT * FROM index_page('hot_id',1);

 itemoffset | ctid  
------------+-------
          1 | (1,1)
          2 | (0,1)
(2 rows)

Unfortunately, information about in-page cleaning and HOT-updates is practically absent in the documentation, and the truth must be sought in the source code. I recommend starting with README.HOT .

To be continued .

Tags:

MVCC-5. In-page cleaning and HOT

In-page cleaning with regular updates

HOT updates

In-page cleaning with HOT updates

HOT chain break

Also popular now: