erogov June 30, 2019 at 18:18

WAL in PostgreSQL: 1. Buffer cache

The previous series was devoted to the isolation and multi-version of PostgreSQL, and today we are starting a new one - about the write-ahead logging mechanism . Let me remind you that the material is based on administrative training courses that Pavel pluzanov and I do , but do not repeat them verbatim and are intended for thoughtful reading and independent experimentation.

This cycle will consist of four parts:

Buffer cache (this article);
Prerecord journal - how it is arranged and how it is used during recovery;
Checkpoint and background recording - why are they needed and how are they configured;
Log tuning - levels and tasks to be solved, reliability and performance.

Why is journaling necessary?

In the process, part of the data that the DBMS deals with is stored in RAM and written to disk (or other non-volatile media) in a deferred manner. The less often this happens, the less input-output and the faster the system works.

But what will happen in the event of a failure, for example, when the power is turned off or if an error occurs in the DBMS code or operating system? All the contents of the RAM will be lost, and only the data written to the disk will remain (with some types of failures, the disk may also suffer, but in this case only a backup copy will help). In principle, I / O can be organized in such a way that the data on the disk is always maintained in a consistent state, but this is difficult and not too efficient (as far as I know, only Firebird went this way).

Usually, including PostgreSQL, the data written to the disk is inconsistent, and when recovering from a failure, special actions are required to restore consistency. Journaling is the very mechanism that makes this possible.

Buffer cache

Strangely enough, we will start talking about journaling with a buffer cache. The buffer cache is not the only structure that is stored in RAM, but one of the most important and complex. Understanding the principle of its operation is important in itself, moreover, in this example we will get acquainted with how data is exchanged between RAM and disk.

Caching is used everywhere in modern computing systems; one processor alone can count three or four cache levels. In general, any cache is needed in order to smooth out the difference in performance between the two types of memory, one of which is relatively fast, but it is not enough for everyone, and the other is relatively slow, but abundant. So the buffer cache smooths the difference between the access time to RAM (nanoseconds) and to disk (milliseconds).

Note that the operating system also has a disk cache that solves the same problem. Therefore, DBMS usually try to avoid double caching by accessing the disk directly, bypassing the OS cache. But in the case of PostgreSQL, this is not so: all data is read and written using ordinary file operations.

In addition, disk arrays, and even the disks themselves, also have their own cache. This fact is still useful to us when we get to the issue of reliability.

But back to the DBMS buffer cache.

It is called so because it is an array of buffers . Each buffer is a place for one data page (block), plus a header. The title, among other things, contains:

location on the disk of the page in the buffer (file and block number in it);
a sign that the data on the page has changed and sooner or later should be written to disk (such a buffer is called dirty );
number of calls to the buffer (usage count);
flag of pinning the buffer (pin count).

The buffer cache is located in the server’s shared memory and is accessible to all processes. To work with data - read or modify, - processes read pages in the cache. While the page is in cache, we work with it in RAM and save on disk accesses.

Initially, the cache contains empty buffers, and all of them are linked into the list of free buffers. The meaning of the pointer to the “next victim” will become clear a bit later. To quickly find the desired page in the cache, a hash table is used.

Search page in cache

When a process needs to read a page, it first tries to find it in the buffer cache using a hash table. The hash key is the file number and page number inside the file. In the corresponding basket of the hash table, the process finds the buffer number and checks if it really contains the desired page. As with any hash table, collisions are possible here; in this case, the process will have to check multiple pages.

Using a hash table has long been criticized. This structure allows you to quickly find the buffer on the page, but it is completely useless if, for example, you need to find all the buffers occupied by a particular table. But no one has proposed a good substitute yet.

If the desired page is found in the cache, the process should “freeze” the buffer by increasing the pin count (several processes can do this simultaneously). As long as the buffer is fixed (counter value is greater than zero), it is considered that the buffer is used and its contents should not be “radically” changed. For example, a new version of the line may appear in the page - this does not bother anyone due to multi-versioning and visibility rules. But another page cannot be read into the pinned buffer.

crowding out

It may happen that the necessary page is not found in the cache. In this case, it must be read from the disk to some buffer.

If there are still free buffers in the cache, the first free one is selected. But sooner or later they will end (usually the size of the database is larger than the memory allocated for the cache) and then you will have to choose one of the occupied buffers, force out the page there and read a new one on the vacant space.

The preemption mechanism is based on the fact that each time the buffer is accessed, the processes increase the usage count in the header of the buffer. Thus, those buffers that are used less often than others have a lower counter value and are good candidates for crowding out.

The clock-sweep algorithm loops through all the buffers (using the pointer to the "next victim"), decreasing their access counts by one. For crowding out, the first buffer is selected, which:

has a zero hit counter (usage count),
and not fixed (zero pin count).

You can see that if all the buffers have a non-zero hit counter, then the algorithm will have to do more than one circle, resetting the counters, until one of them finally goes to zero. To avoid "winding circles" the maximum value of the hit counter is limited to 5. But still, with a large buffer cache size, this algorithm can cause significant overhead.

After the buffer is found, the following happens to it.

The buffer is pinned to show other processes that it is being used. In addition to fixing, other means of blocking are also used, but we will talk more about this separately.

If the buffer turned out to be dirty, that is, it contains changed data, the page cannot be simply discarded - first it needs to be saved to disk. This is not a good situation, since the process that is about to read the page has to wait for the recording of "foreign" data, but this effect is smoothed by the checkpoint and background recording processes, which will be discussed later.

Next, a new page is read from the disk into the selected buffer. The counter of the number of calls is set to one. In addition, the link to the loaded page must be registered in the hash table so that it can be found in the future.

Now the link to the “next victim” points to the next buffer, and the just loaded one has time to increase the hit counter until the pointer goes around the entire buffer cache and returns again.

With my own eyes

As is customary in PostgreSQL, there is an extension that allows you to look inside the buffer cache.

=> CREATE EXTENSION pg_buffercache;

Create a table and insert one row into it.

=> CREATE TABLE cacheme(
  id integer
) WITH (autovacuum_enabled = off);
=> INSERT INTO cacheme VALUES (1);

What will be in the buffer cache? At a minimum, a page should appear in it with a single line added. We will verify this with the following query, in which we select only the buffers belonging to our table (by the file number relfilenode), and decode the layer number (relforknumber):

=> SELECT bufferid,
  CASE relforknumber
    WHEN 0 THEN 'main'
    WHEN 1 THEN 'fsm'
    WHEN 2 THEN 'vm'
  END relfork,
  relblocknumber,
  isdirty,
  usagecount,
  pinning_backends
FROM pg_buffercache
WHERE relfilenode = pg_relation_filenode('cacheme'::regclass);

 bufferid | relfork | relblocknumber | isdirty | usagecount | pinning_backends
----------+---------+----------------+---------+------------+------------------
    15735 | main    |              0 | t       |          1 |                0
(1 row)

So it is - there is one page in the buffer. It is dirty (isdirty), the hit counter is equal to one (usagecount), and it is not fixed by any process (pinning_backends).

Now add another line and repeat the query. To save letters, we insert a line in another session, and repeat the long request with a command \g.

|  => INSERT INTO cacheme VALUES (2);

=> \g

 bufferid | relfork | relblocknumber | isdirty | usagecount | pinning_backends
----------+---------+----------------+---------+------------+------------------
    15735 | main    |              0 | t       |          2 |                0
(1 row)

No new buffers were added - the second line fit on the same page. Please note that the usage counter has increased.

|  => SELECT * FROM cacheme;

|   id
|  ----
|    1
|    2
|  (2 rows)

=> \g

 bufferid | relfork | relblocknumber | isdirty | usagecount | pinning_backends
----------+---------+----------------+---------+------------+------------------
    15735 | main    |              0 | t       |          3 |                0
(1 row)

And after accessing the page for reading, the counter also increases.

And if you clean?

|  => VACUUM cacheme;

=> \g

 bufferid | relfork | relblocknumber | isdirty | usagecount | pinning_backends
----------+---------+----------------+---------+------------+------------------
    15731 | fsm     |              1 | t       |          1 |                0
    15732 | fsm     |              0 | t       |          1 |                0
    15733 | fsm     |              2 | t       |          2 |                0
    15734 | vm      |              0 | t       |          2 |                0
    15735 | main    |              0 | t       |          3 |                0
(5 rows)

The cleaning created a visibility map (one page) and a free space map (three pages - the minimum size of this map).

Well and so on.

Size setting

The cache size is set by the shared_buffers parameter . The default value is ridiculous 128 MB. This is one of the parameters that it makes sense to increase immediately after installing PostgreSQL.

=> SELECT setting, unit FROM pg_settings WHERE name = 'shared_buffers';

 setting | unit
---------+------
 16384   | 8kB
(1 row)

Keep in mind that changing a parameter requires a server restart, since all necessary cache memory is allocated at server startup.

For what reasons to choose the appropriate value?

Even the largest database has a limited set of “hot” data, with which active work is carried out at each moment in time. Ideally, this set should be placed in the buffer cache (plus some space for “one-time” data). If the cache size is smaller, then actively used pages will constantly squeeze each other, creating excessive input-output. But mindlessly increasing the cache is also wrong. With a large size, the overhead costs of maintaining it will increase, and in addition, RAM is also required for other needs.

Thus, the optimal buffer cache size will be different in different systems: it depends on the data, on the application, on the load. Unfortunately, there is no such magical meaning that will suit everyone equally well.

The standard recommendation is to take 1/4 of the RAM as a first approximation (for Windows prior to PostgreSQL 10 it was recommended to choose a size smaller).

And then you need to look at the situation. It’s best to do an experiment: increase or decrease cache size and compare system performance. Of course, for this it is necessary to have a test bench and be able to reproduce the typical load - in the production environment, such experiments look like dubious pleasure.

Be sure to look at the report by Nikolay Samokhvalov at PgConf-2019: " Industrial Approach to PostgreSQL Tuning : Database Experiments "

But some information about what is happening can be gleaned directly on a live system using the same pg_buffercache extension - most importantly, look at the right angle.

For example, you can study the distribution of buffers by their degree of use:

=> SELECT usagecount, count(*)
FROM pg_buffercache
GROUP BY usagecount
ORDER BY usagecount;

 usagecount | count
------------+-------
          1 |   221
          2 |   869
          3 |    29
          4 |    12
          5 |   564
            | 14689
(6 rows)

In this case, a lot of empty counter values are free buffers. Not surprising for a system in which nothing happens.

You can see how much of which tables in our database are cached and how actively these data are used (by active use in this query we mean buffers with a usage counter of more than 3):

=> SELECT c.relname,
  count(*) blocks,
  round( 100.0 * 8192 * count(*) / pg_table_size(c.oid) ) "% of rel",
  round( 100.0 * 8192 * count(*) FILTER (WHERE b.usagecount > 3) / pg_table_size(c.oid) ) "% hot"
FROM pg_buffercache b
  JOIN pg_class c ON pg_relation_filenode(c.oid) = b.relfilenode
WHERE  b.reldatabase IN (
         0, (SELECT oid FROM pg_database WHERE datname = current_database())
       )
AND    b.usagecount is not null
GROUP BY c.relname, c.oid
ORDER BY 2 DESC
LIMIT 10;

          relname          | blocks | % of rel | % hot
---------------------------+--------+----------+-------
 vac                       |    833 |      100 |     0
 pg_proc                   |     71 |       85 |    37
 pg_depend                 |     57 |       98 |    19
 pg_attribute              |     55 |      100 |    64
 vac_s                     |     32 |        4 |     0
 pg_statistic              |     27 |       71 |    63
 autovac                   |     22 |      100 |    95
 pg_depend_reference_index |     19 |       48 |    35
 pg_rewrite                |     17 |       23 |     8
 pg_class                  |     16 |      100 |   100
(10 rows)

Here, for example, it can be seen that the vac table occupies the most place (we used it in one of the previous topics), but no one has addressed it for a long time and it has not yet been squeezed out simply because the free buffers have not yet run out.

You can come up with other sections that will provide useful information for thought. It is only necessary to consider that such requests:

must be repeated several times: the numbers will vary within certain limits;
it is not necessary to perform it constantly (as part of monitoring) due to the fact that the extension blocks operation with the buffer cache for a short time.

And one moment. We should not forget that PostgreSQL works with files through regular calls to the operating system and, thus, there is double caching: pages fall into both the DBMS buffer cache and the OS cache. Thus, the “miss” in the buffer cache does not always lead to the need for real I / O. But the strategy of crowding out the OS is different from the DBMS strategy: the operating system knows nothing about the meaning of the data read.

Mass displacement

In operations that perform bulk reading or writing of data, there is a danger of quickly displacing useful pages from the buffer cache with "one-time" data.

To prevent this from happening, the so-called buffer rings are used for such operations - a small part of the buffer cache is allocated for each operation. Extrusion acts only within the ring, so the rest of the buffer cache data does not suffer.

For sequential reading of large tables (the size of which exceeds a quarter of the buffer cache) 32 pages are allocated. If another process also needs this data while reading a table, it does not start reading the table first, but connects to an existing buffer ring. After scanning, he reads the “missed” beginning of the table.

Let's check. To do this, create a table so that one row occupies a whole page - it’s more convenient to count. The default buffer cache size is 128 MB = 16384 pages of 8 KB. So, you need to insert more than 4096 page-rows into the table.

=> CREATE TABLE big(
  id integer PRIMARY KEY GENERATED ALWAYS AS IDENTITY,
  s char(1000)
) WITH (fillfactor=10);
=> INSERT INTO big(s) SELECT 'FOO' FROM generate_series(1,4096+1);

Let's analyze the table.

=> ANALYZE big;
=> SELECT relpages FROM pg_class WHERE oid = 'big'::regclass;

 relpages
----------
     4097
(1 row)

Now we have to restart the server to clear the cache of the table data that the analysis read.

student$ sudo pg_ctlcluster 11 main restart

After the reboot, read the entire table:

=> EXPLAIN (ANALYZE, COSTS OFF) SELECT count(*) FROM big;

                             QUERY PLAN                              
---------------------------------------------------------------------
 Aggregate (actual time=14.472..14.473 rows=1 loops=1)
   ->  Seq Scan on big (actual time=0.031..13.022 rows=4097 loops=1)
 Planning Time: 0.528 ms
 Execution Time: 14.590 ms
(4 rows)

And make sure that only 32 buffers are occupied by tabular pages in the buffer cache:

=> SELECT count(*)
FROM pg_buffercache
WHERE relfilenode = pg_relation_filenode('big'::regclass);

 count
-------
    32
(1 row)

If sequential scanning is prohibited, the table will be read by index:

=> SET enable_seqscan = off;
=> EXPLAIN (ANALYZE, COSTS OFF) SELECT count(*) FROM big;

                                        QUERY PLAN                                         
-------------------------------------------------------------------------------------------
 Aggregate (actual time=50.300..50.301 rows=1 loops=1)
   ->  Index Only Scan using big_pkey on big (actual time=0.098..48.547 rows=4097 loops=1)
         Heap Fetches: 4097
 Planning Time: 0.067 ms
 Execution Time: 50.340 ms
(5 rows)

In this case, the buffer ring is not used and the entire table appears in the buffer cache (and almost the entire index, too):

=> SELECT count(*)
FROM pg_buffercache
WHERE relfilenode = pg_relation_filenode('big'::regclass);

 count
-------
  4097
(1 row)

In a similar way, buffer rings are used for the cleaning process (also 32 pages) and for bulk write operations COPY IN and CREATE TABLE AS SELECT (usually 2048 pages, but not more than 1/8 of the total buffer cache).

Temporary tables

An exception to the general rule is temporary tables. Since temporary data is visible to only one process, they have nothing to do in the shared buffer cache. Moreover, temporary data exists only within a single session, so it does not need to be protected from failure.

For temporary data, a cache is used in the local memory of the process that owns the table. Since such data is available to only one process, it does not need to be protected with locks. The local cache uses the usual preemptive algorithm.

Unlike the general buffer cache, memory for the local cache is allocated as needed, because temporary tables are not used in all sessions. The maximum amount of memory for temporary tables in one session is limited by the temp_buffers parameter .

Warming up the cache

After restarting the server, some time should pass before the cache “warms up” - accumulates actual actively used data. Sometimes it may be useful to immediately read the data of certain tables into the cache, and a special extension is designed for this:

=> CREATE EXTENSION pg_prewarm;

Previously, an extension could only read certain tables in the buffer cache (or only in the OS cache). But in PostgreSQL 11, it was able to save the current cache status to disk and restore it after a server reboot. To take advantage of this, you need to add the library to shared_preload_libraries and restart the server.

=> ALTER SYSTEM SET shared_preload_libraries = 'pg_prewarm';

student$ sudo pg_ctlcluster 11 main restart

The restart field, if the pg_prewarm.autoprewarm parameter has not changed , the autoprewarm master background process will automatically start, which once in pg_prewarm.autoprewarm_interval will flush the list of cached pages to disk (do not forget to take into account the new process when setting max_parallel_processes ).

=> SELECT name, setting, unit FROM pg_settings WHERE name LIKE 'pg_prewarm%';

              name               | setting | unit
---------------------------------+---------+------
 pg_prewarm.autoprewarm          | on      |
 pg_prewarm.autoprewarm_interval | 300     | s
(2 rows)

postgres$ ps -o pid,command --ppid `head -n 1 /var/lib/postgresql/11/main/postmaster.pid` | grep prewarm

10436 postgres: 11/main: autoprewarm master

Now there is no big table in the cache:

=> SELECT count(*)
FROM pg_buffercache
WHERE relfilenode = pg_relation_filenode('big'::regclass);

 count
-------
     0
(1 row)

If we assume that all its contents are very important, we can read it into the buffer cache by calling the following function:

=> SELECT pg_prewarm('big');

 pg_prewarm
------------
       4097
(1 row)

=> SELECT count(*)
FROM pg_buffercache
WHERE relfilenode = pg_relation_filenode('big'::regclass);

 count
-------
  4097
(1 row)

The list of pages is dumped to the autoprewarm.blocks file. To see it, you can just wait until the autoprewarm master process runs for the first time, but we manually initiate this:

=> SELECT autoprewarm_dump_now();

 autoprewarm_dump_now
----------------------
                 4340
(1 row)

The number of discarded pages is more than 4097 - this includes the pages of system catalog objects already read by the server. And here is the file:

postgres$ ls -l /var/lib/postgresql/11/main/autoprewarm.blocks

-rw------- 1 postgres postgres 102078 июн 29 15:51 /var/lib/postgresql/11/main/autoprewarm.blocks

Now restart the server again.

student$ sudo pg_ctlcluster 11 main restart

And immediately after the launch, our table again appears in the cache.

=> SELECT count(*)
FROM pg_buffercache
WHERE relfilenode = pg_relation_filenode('big'::regclass);

 count
-------
  4097
(1 row)

This provides the same autoprewarm master process: it reads the file, splits the pages into databases, sorts them (so that reading from the disk is as consistent as possible), and passes the autoprewarm worker to the separate workflow for processing.

To be continued .

Tags: