erogov May 27, 2019 at 10:38

MVCC-6. Cleaning

We started with issues related to isolation , made a digression about organizing data at a low level , then talked in detail about row versions and how snapshots are obtained from versions .

Last time we talked about HOT-updates and in-page cleaning, and today we’ll take a look at the well-known ordinary cleaning, vacuum vulgaris . Yes, so much has already been written about her that I’m unlikely to say anything new, but the completeness of the picture requires sacrifice. Be patient.

Normal cleaning (vacuum)

What does the cleaning do

Intra-page cleaning is quick, but frees up only a fraction of the space. It works within the same tabular page and does not affect indexes.

The main, “normal” cleaning is performed by the VACUUM command and we will call it simply cleaning (and we will talk about auto-cleaning separately).

So, cleanup processes the table completely. It cleans not only unnecessary versions of strings, but also references to them from all indices.

Processing occurs in parallel with other activities in the system. In this case, the table and indexes can be used in the usual way both for reading and for editing (however, simultaneous execution of commands such as CREATE INDEX, ALTER TABLE and some others will be impossible).

Only those pages in which some activity occurred are viewed in the table. For this, a visibility map is used (I remind you that pages containing only fairly old versions of rows that are guaranteed to be visible in all data snapshots are marked in it). Only pages that are not marked on the map are processed, while the map itself is updated.

In the process, the free space map is updated to reflect the free space that appears in the pages.

As usual, create a table:

=> CREATE TABLE vac(
  id serial,
  s char(100)
) WITH (autovacuum_enabled = off);
=> CREATE INDEX vac_s ON vac(s);
=> INSERT INTO vac(s) VALUES ('A');
=> UPDATE vac SET s = 'B';
=> UPDATE vac SET s = 'C';

Using the autovacuum_enabled parameter , we turn off automatic cleaning. We will talk about it next time, but for now - for experiments - it is important for us to manage the cleaning manually.

There are three versions of the row in the table now, and each one is linked from an index:

=> SELECT * FROM heap_page('vac',0);

 ctid  | state  |   xmin   |   xmax   | hhu | hot | t_ctid 
-------+--------+----------+----------+-----+-----+--------
 (0,1) | normal | 4000 (c) | 4001 (c) |     |     | (0,2)
 (0,2) | normal | 4001 (c) | 4002     |     |     | (0,3)
 (0,3) | normal | 4002     | 0 (a)    |     |     | (0,3)
(3 rows)

=> SELECT * FROM index_page('vac_s',1);

 itemoffset | ctid  
------------+-------
          1 | (0,1)
          2 | (0,2)
          3 | (0,3)
(3 rows)

After cleaning, the "dead" versions disappear and there is only one, relevant. And the index also has one link left:

=> VACUUM vac;
=> SELECT * FROM heap_page('vac',0);

 ctid  | state  |   xmin   | xmax  | hhu | hot | t_ctid 
-------+--------+----------+-------+-----+-----+--------
 (0,1) | unused |          |       |     |     | 
 (0,2) | unused |          |       |     |     | 
 (0,3) | normal | 4002 (c) | 0 (a) |     |     | (0,3)
(3 rows)

=> SELECT * FROM index_page('vac_s',1);

 itemoffset | ctid  
------------+-------
          1 | (0,3)
(1 row)

Please note that the first two pointers received the status unused, and not dead, as it would be with intra-page cleaning.

And again about the transaction horizon

How does PostgreSQL determine which row versions can be considered “dead”? We already considered the concept of a transaction horizon when we talked about data snapshots, but this is such an important topic that it’s not a sin to repeat it.

Let's start the previous experience again.

=> TRUNCATE vac;
=> INSERT INTO vac(s) VALUES ('A');
=> UPDATE vac SET s = 'B';

But before updating the line again, let another transaction begin (but not end). In our example, it will work at the Read Committed level, but should get a real (non-virtual) transaction number. For example, it can change or even just lock some rows in any table, not necessarily in vac:

|  => BEGIN;
|  => SELECT s FROM t FOR UPDATE;

|    s  
|  -----
|   FOO
|   BAR
|  (2 rows)

=> UPDATE vac SET s = 'C';

There are three rows in the table, and three links in the index. What happens after cleaning?

=> VACUUM vac;
=> SELECT * FROM heap_page('vac',0);

 ctid  | state  |   xmin   |   xmax   | hhu | hot | t_ctid 
-------+--------+----------+----------+-----+-----+--------
 (0,1) | unused |          |          |     |     | 
 (0,2) | normal | 4005 (c) | 4007 (c) |     |     | (0,3)
 (0,3) | normal | 4007 (c) | 0 (a)    |     |     | (0,3)
(3 rows)

=> SELECT * FROM index_page('vac_s',1);

 itemoffset | ctid  
------------+-------
          1 | (0,2)
          2 | (0,3)
(2 rows)

There are two versions of the row left in the table: the cleanup decided that version (0.2) could not be deleted yet. The reason, of course, is in the database transaction horizon, which in our example is determined by an incomplete transaction:

|  => SELECT backend_xmin FROM pg_stat_activity WHERE pid = pg_backend_pid();

|   backend_xmin 
|  --------------
|           4006
|  (1 row)

You can ask for cleaning to talk about what happens:

=> VACUUM VERBOSE vac;

INFO:  vacuuming "public.vac"
INFO:  index "vac_s" now contains 2 row versions in 2 pages
DETAIL:  0 index row versions were removed.
0 index pages have been deleted, 0 are currently reusable.
CPU: user: 0.00 s, system: 0.00 s, elapsed: 0.00 s.
INFO:  "vac": found 0 removable, 2 nonremovable row versions in 1 out of 1 pages
DETAIL:  1 dead row versions cannot be removed yet, oldest xmin: 4006
There were 1 unused item pointers.
Skipped 0 pages due to buffer pins, 0 frozen pages.
0 pages are entirely empty.
CPU: user: 0.00 s, system: 0.00 s, elapsed: 0.00 s.
VACUUM

Note:

2 nonremovable row versions - 2 versions were found in the table that cannot be deleted,
1 dead row versions cannot be removed yet - including 1 “dead”,
oldest xmin shows the current horizon.

We repeat the conclusion once again: the presence of long-lived transactions in the database (not completed or really long running) can lead to the expansion of (bloat) tables, regardless of how often the cleanup is performed. Therefore, in PostgreSQL, OLTP and OLAP workloads are poorly combined in one database: reports that run for hours will not allow frequently updated tables to be cleared on time. A possible solution would be to create a separate “reporting” replica.

After the completion of an open transaction, the horizon shifts and the situation is corrected:

|  => COMMIT;

=> VACUUM VERBOSE vac;

INFO:  vacuuming "public.vac"
INFO:  scanned index "vac_s" to remove 1 row versions
DETAIL:  CPU: user: 0.00 s, system: 0.00 s, elapsed: 0.00 s
INFO:  "vac": removed 1 row versions in 1 pages
DETAIL:  CPU: user: 0.00 s, system: 0.00 s, elapsed: 0.00 s
INFO:  index "vac_s" now contains 1 row versions in 2 pages
DETAIL:  1 index row versions were removed.
0 index pages have been deleted, 0 are currently reusable.
CPU: user: 0.00 s, system: 0.00 s, elapsed: 0.00 s.
INFO:  "vac": found 1 removable, 1 nonremovable row versions in 1 out of 1 pages
DETAIL:  0 dead row versions cannot be removed yet, oldest xmin: 4008
There were 1 unused item pointers.
Skipped 0 pages due to buffer pins, 0 frozen pages.
0 pages are entirely empty.
CPU: user: 0.00 s, system: 0.00 s, elapsed: 0.00 s.
VACUUM

Now the page has only the latest current version of the line:

=> SELECT * FROM heap_page('vac',0);

 ctid  | state  |   xmin   | xmax  | hhu | hot | t_ctid 
-------+--------+----------+-------+-----+-----+--------
 (0,1) | unused |          |       |     |     | 
 (0,2) | unused |          |       |     |     | 
 (0,3) | normal | 4007 (c) | 0 (a) |     |     | (0,3)
(3 rows)

There is also only one entry in the index:

=> SELECT * FROM index_page('vac_s',1);

 itemoffset | ctid  
------------+-------
          1 | (0,3)
(1 row)

What's going on inside

Cleaning should process both the table and indexes at the same time, and do so in a way that does not block the operation of other processes. How does she do it?

It all starts with a table scan (taking into account the visibility map, as already noted). In the read pages, unnecessary versions of strings are determined and their identifiers (tid) are written to a special array. The array is located in the local memory of the cleaning process; a fragment of maintenance_work_mem size is allocated for it . The default value for this parameter is 64 MB. Note that this memory is allocated immediately in full, and not as needed. True, if the table is small, then the fragment is allocated less.

Next, one of two things: either we will reach the end of the table, or the memory allocated for the array will end. In either of the two cases, the index cleanup phase begins . To do this, each of the indexes created on the table is completely scanned in search of records that reference stored versions of rows. Found records are cleared from index pages.

At this point we get the following picture: in the indexes there are no longer links to unnecessary versions of rows, but they still exist in the table. This does not contradict anything: when executing a query, we either do not get to dead versions of rows at all (with index access), or we mark them when checking visibility (when scanning a table).

After that, the table cleaning phase begins. The table is scanned again to read the necessary pages, to clear out the stored versions of the lines from them, and to free the pointers. We can do this because there are no links from indexes anymore.

If on the first pass the table was not completely read, then the array is cleared and everything is repeated from the place where we left off.

Thus:

the table is always scanned twice;
if so many row versions are deleted during cleaning that all of them cannot fit into maintenance_work_mem memory , then all indexes will be completely scanned as many times as needed.

On large tables, this can take a considerable amount of time and create a significant load on the system. Of course, requests will not be blocked, but the "extra" I / O is also unpleasant.

To speed up the process, it makes sense to either call for cleaning more often (so that not a very large number of row versions are cleared each time), or allocate more memory.

I note in parentheses that, starting with version 11, PostgreSQL can skip index scans if this is not absolutely necessary. This should make life easier for owners of large tables in which rows are only added (but not changed).

Monitoring

How to understand that cleaning does not cope with work in one pass?

We have already seen the first method: you can call the VACUUM command with VERBOSE. Then information on the phases of the work will be displayed on the console.

Secondly, starting with version 9.6, there is a view pg_stat_progress_vacuum, which also contains all the necessary information.

(There is still a third way - to output information to the message log, but this only works for

auto-cleaning , which will be discussed next time.) We will insert more rows in the table so that the cleaning takes noticeable time, and we will update them all so that the cleaning has something to do .

=> TRUNCATE vac;
=> INSERT INTO vac(s) SELECT 'A' FROM generate_series(1,500000);
=> UPDATE vac SET s  = 'B';

Reduce the size of the memory allocated for the identifier array:

=> ALTER SYSTEM SET maintenance_work_mem = '1MB';
=> SELECT pg_reload_conf();

We start cleaning and, while it works, we will turn to the pg_stat_progress_vacuum view several times:

=> VACUUM VERBOSE vac;

|  => SELECT * FROM pg_stat_progress_vacuum \gx

|  -[ RECORD 1 ]------+------------------
|  pid                | 6715
|  datid              | 41493
|  datname            | test
|  relid              | 57383
|  phase              | vacuuming indexes
|  heap_blks_total    | 16667
|  heap_blks_scanned  | 2908
|  heap_blks_vacuumed | 0
|  index_vacuum_count | 0
|  max_dead_tuples    | 174762
|  num_dead_tuples    | 174480

|  => SELECT * FROM pg_stat_progress_vacuum \gx

|  -[ RECORD 1 ]------+------------------
|  pid                | 6715
|  datid              | 41493
|  datname            | test
|  relid              | 57383
|  phase              | vacuuming indexes
|  heap_blks_total    | 16667
|  heap_blks_scanned  | 5816
|  heap_blks_vacuumed | 2907
|  index_vacuum_count | 1
|  max_dead_tuples    | 174762
|  num_dead_tuples    | 174480

Here we see in particular:

name of the current phase (phase) - we talked about three main phases, but in general there are more of them ;
total number of table pages (heap_blks_total);
the number of pages crawled (heap_blks_scanned);
the number of pages already cleared (heap_blks_vacuumed);
the number of passes by index (index_vacuum_count).

The overall progress is determined by the ratio of heap_blks_vacuumed to heap_blks_total, but keep in mind that this value does not change smoothly, but “jerkily” due to index scans. However, the main attention should be paid to the number of cleaning cycles - a value greater than 1 means that the allocated memory was not enough to complete the cleaning in one pass.

The output of the VACUUM VERBOSE command, completed by this time, will show the big picture:

INFO:  vacuuming "public.vac"

INFO:  scanned index "vac_s" to remove 174480 row versions
DETAIL:  CPU: user: 0.50 s, system: 0.07 s, elapsed: 1.36 s
INFO:  "vac": removed 174480 row versions in 2908 pages
DETAIL:  CPU: user: 0.02 s, system: 0.02 s, elapsed: 0.13 s

INFO:  scanned index "vac_s" to remove 174480 row versions
DETAIL:  CPU: user: 0.26 s, system: 0.07 s, elapsed: 0.81 s
INFO:  "vac": removed 174480 row versions in 2908 pages
DETAIL:  CPU: user: 0.01 s, system: 0.02 s, elapsed: 0.10 s

INFO:  scanned index "vac_s" to remove 151040 row versions
DETAIL:  CPU: user: 0.13 s, system: 0.04 s, elapsed: 0.47 s
INFO:  "vac": removed 151040 row versions in 2518 pages
DETAIL:  CPU: user: 0.01 s, system: 0.02 s, elapsed: 0.08 s

INFO:  index "vac_s" now contains 500000 row versions in 17821 pages
DETAIL:  500000 index row versions were removed.
8778 index pages have been deleted, 0 are currently reusable.
CPU: user: 0.00 s, system: 0.00 s, elapsed: 0.00 s.
INFO:  "vac": found 500000 removable, 500000 nonremovable row versions in 16667 out of 16667 pages
DETAIL:  0 dead row versions cannot be removed yet, oldest xmin: 4011
There were 0 unused item pointers.
0 pages are entirely empty.
CPU: user: 1.10 s, system: 0.37 s, elapsed: 3.71 s.
VACUUM

Here you can see that in total there were three passes through the indices, each of which cleared 174,480 pointers to dead versions of strings. Where does this number come from? One link (tid) takes 6 bytes, and 1024 * 1024/6 = 174762 is the number we see in pg_stat_progress_vacuum.max_dead_tuples. Actually, it can be used a little less: it is guaranteed that when reading the next page all the pointers to the "dead" versions will fit exactly in memory.

Analysis

The analysis, or, in other words, the collection of statistical information for the query planner, is not formally connected with the cleanup. Nevertheless, we can perform the analysis not only with the ANALYZE team, but also combine the cleaning with the analysis: VACUUM ANALYZE. In this case, cleaning is first performed, and then analysis - no savings occur.

But, as we will see later, automatic cleaning and automatic analysis are performed in one process and are managed in a similar way.

Full cleaning (vacuum full)

As we have seen, conventional cleaning frees up more space than intra-page cleaning, but even it does not always solve the problem completely.

If for some reason a table or index has grown significantly in size, then regular cleaning will free up space inside existing pages: they will have holes that will then be used to insert new versions of rows. But the number of pages will not change, and, therefore, from the point of view of the operating system, the files will occupy exactly the same amount of space as they occupied before cleaning. And this is bad because:

full scanning of a table (or index) slows down;
a larger buffer cache may be required (because pages are stored, and the density of useful information decreases);
an “extra” level may appear in the index tree, which will slow down index access;
файлы занимают лишнее место на диске и в резервных копиях.

(The only exception is the completely cleaned pages at the end of the file — such pages “bite off” the file and return to the operating system.)

If the percentage of useful information in the files falls below a reasonable limit, the administrator can perform a full table cleanup. At the same time, the table and all its indices are completely rebuilt from scratch, and the data is packed as compactly as possible (of course, taking into account the fillfactor parameter). When rebuilding, PostgreSQL sequentially rebuilds the table first, and then each of its indexes. New files are created for each object, and at the end of the rebuild, old files are deleted. Please note that in the process of working on the disk will require additional space.

To illustrate, insert a number of rows into the table again:

=> TRUNCATE vac;
=> INSERT INTO vac(s) SELECT 'A' FROM generate_series(1,500000);

How to evaluate the density of information? To do this, it is convenient to use the special extension:

=> CREATE EXTENSION pgstattuple;
=> SELECT * FROM pgstattuple('vac') \gx

-[ RECORD 1 ]------+---------
table_len          | 68272128
tuple_count        | 500000
tuple_len          | 64500000
tuple_percent      | 94.47
dead_tuple_count   | 0
dead_tuple_len     | 0
dead_tuple_percent | 0
free_space         | 38776
free_percent       | 0.06

The function reads the entire table and shows statistics on how much space what data is occupied in the files. The main information that we are interested in now is the tuple_percent field: the percentage occupied by useful data. It is less than 100 due to the inevitable overhead of service information inside the page, but nonetheless quite high.

For the index, other information is displayed, but the avg_leaf_density field has the same meaning: the percentage of useful information (in leaf pages).

=> SELECT * FROM pgstatindex('vac_s') \gx

-[ RECORD 1 ]------+---------
version            | 3
tree_level         | 3
index_size         | 72802304
root_block_no      | 2722
internal_pages     | 241
leaf_pages         | 8645
empty_pages        | 0
deleted_pages      | 0
avg_leaf_density   | 83.77
leaf_fragmentation | 64.25

And here is the size of the table and index:

=> SELECT pg_size_pretty(pg_table_size('vac')) table_size,
  pg_size_pretty(pg_indexes_size('vac')) index_size;

 table_size | index_size 
------------+------------
 65 MB      | 69 MB
(1 row)

Now delete 90% of all lines. We select the rows for deletion randomly, so that in every page with a high probability at least one row remains:

=> DELETE FROM vac WHERE random() < 0.9;

DELETE 450189

What size will objects have after normal cleaning?

=> VACUUM vac;
=> SELECT pg_size_pretty(pg_table_size('vac')) table_size,
  pg_size_pretty(pg_indexes_size('vac')) index_size;

 table_size | index_size 
------------+------------
 65 MB      | 69 MB
(1 row)

We see that the size has not changed: regular cleaning cannot reduce the size of files in any way. Although the density of information has obviously decreased by about 10 times:

=> SELECT vac.tuple_percent, vac_s.avg_leaf_density
FROM pgstattuple('vac') vac, pgstatindex('vac_s') vac_s;

 tuple_percent | avg_leaf_density 
---------------+------------------
          9.41 |             9.73
(1 row)

Now check what happens after a full cleanup. Here are the files used by the table and indexes now:

=> SELECT pg_relation_filepath('vac'), pg_relation_filepath('vac_s');

 pg_relation_filepath | pg_relation_filepath 
----------------------+----------------------
 base/41493/57392     | base/41493/57393
(1 row)

=> VACUUM FULL vac;
=> SELECT pg_relation_filepath('vac'), pg_relation_filepath('vac_s');

 pg_relation_filepath | pg_relation_filepath 
----------------------+----------------------
 base/41493/57404     | base/41493/57407
(1 row)

Now the files are replaced with new ones. The size of the table and index has decreased significantly, and the density of information has accordingly increased:

=> SELECT pg_size_pretty(pg_table_size('vac')) table_size,
  pg_size_pretty(pg_indexes_size('vac')) index_size;

 table_size | index_size 
------------+------------
 6648 kB    | 6480 kB
(1 row)

=> SELECT vac.tuple_percent, vac_s.avg_leaf_density
FROM pgstattuple('vac') vac, pgstatindex('vac_s') vac_s;

 tuple_percent | avg_leaf_density 
---------------+------------------
         94.39 |            91.08
(1 row)

Please note that the density of information in the index has even increased compared to the original. Re-creating an index (B-tree) from the available data is more profitable than inserting data into an existing index line by line.

The pgstattuple extension functions we used read the entire table. If the table is large, then this is inconvenient, and therefore there is a function pgstattuple_approx there, which skips the pages marked in the visibility map and shows approximate numbers.

An even faster, but even less accurate way is to estimate the ratio of data volume to file size in the system directory. Options for such queries can be found on the wiki .

A full cleanup does not require regular use, as it completely blocks all work with the table (including querying it) for the entire duration of its work. It is clear that on an actively used system this may be unacceptable. Locks will be considered separately, but for now we will restrict ourselves to mentioning the pg_repack extension , which locks the table only for a short time at the end of work.

Similar teams

There are several commands that also completely rebuild tables and indexes, and this is similar to a complete cleanup. All of them completely block the work with the table, all of them delete the old data files and create new ones.

The CLUSTER command is similar in everything to VACUUM FULL, but additionally physically arranges the version of the strings according to one of the available indices. This gives the scheduler the ability to use index access more effectively in some cases. However, it should be understood that clustering is not supported: with subsequent changes to the table, the physical order of row versions will be violated.

The REINDEX command rebuilds a single index on a table. In fact, VACUUM FULL and CLUSTER use this command to rebuild indexes.

The TRUNCATE command logically works the same as DELETE - it deletes all table rows. But DELETE, as already discussed, only marks the version of the rows as deleted, which requires further cleaning. TRUNCATE just creates a new, clean file. As a rule, this works faster, but keep in mind that TRUNCATE will completely block the work with the table for the whole time until the end of the transaction.

To be continued .

Tags: