Parallelism in PostgreSQL: not spherical, not a horse, not in a vacuum
Scaling a DBMS is a continually coming future. DBMSs are being improved and better scaled on hardware platforms, while hardware platforms themselves increase productivity, the number of cores, and memory — Achilles catches up with a turtle, but still has not caught up. The problem of scaling the database is full-length.
Postgres Professional faced the problem of scaling not only theoretically, but also practically: from its customers. And more than once. About one of these cases and will be discussed in this article.
PostgreSQL scales well on NUMA systems if it is a single motherboard with multiple processors and multiple data buses. Some optimizations can be read here and here.. However, there is another class of systems, they have several motherboards, the data exchange between them is carried out using an interconnect, and one instance of the operating system works on them and for the user this design looks like a single machine. And although formally such systems can also be attributed to NUMA, but in essence they are closer to supercomputers, since access to the local memory of a node and access to the memory of a neighboring node differ radically. The PostgreSQL community believes that the only Postgres instance running on such architectures is the source of the problems, and there is no systematic approach to solving them.
This is explained by the fact that the software architecture that uses shared memory is fundamentally calculated on the fact that the access time of different processes to its own and remote memory is more or less comparable. In the case when we work with many nodes, the rate on shared memory as a fast communication channel ceases to justify itself, because due to delays (latency) it is much “cheaper” to send a request to perform a certain action to the node (node) data of interest than to send this data on the bus. Therefore, cluster solutions are relevant for supercomputers and systems with many nodes in general.
This does not mean that a combination of multisite systems and a typical Postgres shared memory architecture must be put in a cross. After all, if the postgres processes spend most of the time doing complex calculations locally, then this architecture will even be very efficient. In our situation, the client had already purchased a powerful multi-node server, and we had to solve PostgreSQL problems on it.
And the problems were serious: the simplest write requests (change several field values in one record) were executed from a few minutes to an hour. As was later confirmed, these problems manifested themselves in all their glory precisely because of the large number of cores and, consequently, radical parallelism in the execution of requests with a relatively slow exchange between nodes.
Therefore, the article will turn out as a dual purpose:
- Share experience: what to do if in a multi-node system the base slows down in earnest. Where to start, how to diagnose, where to go.
- To tell how the problems of the PostgreSQL DBMS itself can be solved with a high level of concurrency. This includes how changing a lock algorithm affects PostgreSQL performance.
Server and DB
The system consisted of 8 blades with 2 sockets in each. In the amount of more than 300 cores (excluding hypertreaming). A fast tire (manufacturer's proprietary technology) connects the blades. Not that a supercomputer, but for a single instance of a DBMS, the configuration is impressive.
The load is also rather big. More than 1 terabyte of data. About 3000 transactions per second. Over 1000 connections to postgres.
Starting to deal with hourly wait recording, the first thing we have excluded as a reason for the delay recording on the disk. As soon as incomprehensible delays began, tests began to be done exclusively on
tmpfs. The picture has not changed. The disk has nothing to do with it.
Getting mining diagnoses: submission
Since problems arose, most likely because of the high competition of processes that “knock” on the same objects, the first thing to check is blocking. In PostgreSQL, for such a check, there is a view
pg_stat_activity. In the second, in version 9.6, information was added about what the process was waiting for ( Amit Kapila, Ildus Kurbangaliev ) -
wait_event_type. Possible values for this field are described here .
But first, just count:
postgres=# SELECT COUNT(*) FROM pg_locks; count —---— 88453 (1 row) postgres=# SELECT COUNT(*) FROM pg_stat_activity; count —---— 1826 (1 row) postgres=# SELECT COUNT(*) FROM pg_stat_activity WHERE state ='active'; count —---— 1005
These are real numbers. Reached up to 200,000 locks.
At the same time on the ill-fated request hung such locks:
SELECTCOUNT(mode), modeFROM pg_locks WHERE pid =580707GROUPBYmode; count | mode —-----+---------------— 93 | AccessShareLock 1 | ExclusiveLock
When reading the buffer, the DBMS uses a lock
share, while writing it -
exclusive. That is, write locks accounted for less than 1% of all requests.
In the view, the
pg_lockstypes of locks do not always look as described in the user documentation .
Here is a label:
AccessShareLock = LockTupleKeyShare RowShareLock = LockTupleShare ExclusiveLock = LockTupleNoKeyExclusive AccessExclusiveLock = LockTupleExclusive
The SELECT mode FROM pg_locks query showed that executions of the CREATE INDEX command (without CONCURRENTLY) are waiting for 234 INSERTs and 390 INSERTs are waiting
buffer content lock. A possible solution is to “teach” INSERTs from different sessions to intersect less over buffers.
It's time to use perf
perfcollects a lot of diagnostic information. In the mode
record... it records the statistics of system events in the files (by default, they are in
./perf_data), and in the mode it
reportalizes the collected data, you can, for example, filter events related to
$ perf record -u postgres или $ perf record -p 76876 а затем, скажем $ perf report > ./my_results
As a result, we'll see something like.
How to use
perfPostgreSQL for diagnostics is described, for example, here , as well as in the pg-wiki .
In our case, even the simplest mode gave important information -
perf topworking, naturally, in the spirit of the
topoperating system. With the help,
perf topwe saw that the processor spends most of the time in kernel locks, as well as in functions
PinBuffer()Is a function that increases the buffer reference count (displaying a data page to RAM), by which postgres processes know which buffers can be preempted and which ones can not.
LWLockAttemptLock()- the function of taking
LWLock- this is a kind of loka with two levels
exclusive, without defining
deadlock's, locks are pre-allocated in
shared memory, waiting processes are waiting in the queue.
These functions have already been seriously optimized in PostgreSQL 9.5 and 9.6. Spinlock inside them were replaced by direct use of atomic operations.
It is impossible to do without them: even if they were useless, they would still be worth telling about them - they are extraordinarily beautiful. But they are useful. Here is an illustration from
github, not from our case (neither we nor the client are ready for the disclosure of details yet).
These beautiful pictures very clearly show what the processor cycles are going on. Data can collect the same
flame graphlucidly visualizes the data, and builds trees based on the collected call stacks. Details about profiling with flame graphs can be found, for example, here , and download everything you need here .
In our case, the flame graphs showed a huge amount
nestloop. Apparently, the JOINs of a large number of tables in numerous parallel read requests caused a large number of
The statistics collected
perfshow where the processor cycles go. And although we have seen that most of the CPU time passes on locks, we did not see what exactly leads to such long waits for locks, because we don’t see where locks waits exactly, since pending processor time is not wasted.
In order to see for yourself the expectations, you can build a query to the system view
SELECT wait_event_type, wait_event, COUNT(*) FROM pg_stat_activity GROUPBY wait_event_type, wait_event;
LWLockTranche | buffer_content | UPDATE ************* LWLockTranche | buffer_content | INSERT INTO ******** LWLockTranche | buffer_content | \r | | insert into B4_MUTEX | | values (nextval('hib | | returning ID Lock | relation | INSERT INTO B4_***** LWLockTranche | buffer_content | UPDATE ************* Lock | relation | INSERT INTO ******** LWLockTranche | buffer_mapping | INSERT INTO ******** LWLockTranche | buffer_content | \r
(asterisks here simply replace the details of the request that we do not disclose).
Values are visible
buffer_content(locking the contents of buffers) and
buffer_mapping(locks on the components of the hash label
For help with GDB
But why so many expectations for these types of locks? For more detailed information about the expectations, we had to use the debugger
GDB. With the help
GDBwe can receive a stack of calls of specific processes. Applying the sampling, i.e. Having collected a certain number of random call stacks, you can get an idea of which stacks have the longest expectations.
Consider the process of building statistics. We will consider the “manual” collection of statistics, although in real life special scripts are used that do this automatically.
gdbyou need to attach to the PostgreSQL process. To do this, you need to find a
pidserver process, say from
$ ps aux | grep postgres
Suppose we discovered:
postgres 2025 0.0 0.1172428 1240 pts/17 S июл23 0:00 /usr/local/pgsql/bin/postgres -D /usr/local/pgsql/data
and now we
pidput in debager:
igor_le:~$gdb -p 2025
Once inside the debager, we write
where. And we get a lot of information like this:
(gdb) bt #00x00007fbb65d01cd0in __write_nocancel () from /lib64/libc.so.6 #10x00000000007c92f4in write_pipe_chunks ( data=0x110e6e8"2018‐06‐01 15:35:38 MSK : [392‐1] db=bp,user=bp,app=[unknown],client=192.168.70.163 (http://192.168.70.163) LOG: relation 23554 new block 493: 248.389503\n2018‐06‐01 15:35:38 MSK : [393‐1] db=bp,user=bp,app=["..., len=409, dest=dest@entry=1) at elog.c:3123 #20x00000000007cc07bin send_message_to_server_log (edata=0xc6ee60 <errordata>) at elog.c:3024 #3EmitErrorReport () at elog.c:1479
Collecting statistics, including call stacks from all postgre processes collected repeatedly at different points in time, we saw that for 3706 seconds (about an hour) the waiting
buffer partition lockinside lasted
relation extension lock, that is, blocking the hash table of the buffer manager that was necessary for displacing the old buffer to later replace it with a new one corresponding to the extended part of the table. A number was also noticeable
buffer content lock, which corresponded to the expectation of locking pages of the
B-treeindex for the implementation of the insert.
At first, there were two explanations for such a monstrous waiting time:
- Someone else took this
LWLockand stuck it. But this is unlikely. Because nothing complicated inside the buffer partition lock happens.
- We are faced with some kind of pathological behavior
LWLock. That is, despite the fact that no one took a lock for too long, his expectation stretched out unnecessarily for a long time.
Diagnostic patches and treatment of trees
Having reduced the number of simultaneous connections, we would surely discharge the request flow for locks. But it would be like a surrender. Instead, Alexander Korotkov , the chief architect of Postgres Professional (of course, he helped prepare this article), suggested a series of patches.
First of all, it was necessary to get a more detailed picture of the disaster. No matter how good the finished tools, diagnostic patches of their own making will be useful.
A patch was written, adding a detailed logging of the time spent in
relation extension, what happens inside the function
RelationAddExtraBlocks(). So we know what time is spent inside
And another patch was written in support of it, reporting
pg_stat_activityon what we are doing now
relation extension. It was done this way: when expanding
RelationAddExtraBlocks. This process is now conveniently analyzed with maximum details using
Actually therapeutic (and not diagnostic) patches were written two. The first patch changed the behavior of leaf locks
B‐tree: earlier, when prompted to insert, the leaf was blocked as
share, and after that it received
exclusive. Now he immediately receives
exclusive. Now this patch is already commited for PostgreSQL 12 . Fortunately, this year Alexander Korotkov received committer status - the second PostgreSQL committer in Russia and the second in the company.
It has also been increased from 128 to 512.
NUM_BUFFER_PARTITIONSto reduce the load on the mapping locks: the hash table of the buffer manager was divided into smaller pieces, in the hope that the load on each specific piece will decrease.
After the application of this patch, the blocking for the contents of the buffers is gone, but in spite of the increase
NUM_BUFFER_PARTITIONS, there remain
buffer_mapping, that is, we recall, blocking the pieces of the hash table of the buffer manager:
locks_count | active_session | buffer_content | buffer_mapping ----‐‐‐--‐‐‐+‐------‐‐‐‐‐‐‐‐‐+‐‐‐------‐‐‐‐‐‐‐+‐‐------‐‐‐ 12549 |1218| 0 |15
And even not a lot. B-tree has ceased to be a bottleneck. The expansion came to the fore
Treatment of conscience
Next, Alexander put forward the following hypothesis and solution:
We are waiting for a lot of time on the
buffer parittion lockdisplacement of the buffer. Perhaps
buffer parittion lockthere is some very popular page on this, for example, the root of some
B‐tree. In this place there is a non-stop stream of requests
shared lockfrom reading requests.
Waiting queue at
LWLock'e' not fair. Since
shared lock's can be taken as many as you like at the same time, then if
shared lockalready taken, then subsequent ones
shared lock' and pass without a turn. Thus, if the stream of shared lock'ov has sufficient intensity so that between them there are no "windows", then the wait
exclusive lockgoes almost to infinity.
To fix this, you can try to offer - a patch of "gentleman's" behavior of locks. He awakens
shared lockerconsciences and they honestly stand in a queue when they are already there
exclusive lock(interestingly, heavy locks
hwlockhave no problems with conscience: they always honestly turn in the queue)
locks_count | active_session | buffer_content | buffer_mapping | reladdextra | inserts>30sec ‐‐‐‐‐‐-‐‐‐‐‐+‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐+‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐+‐‐‐‐‐‐‐‐‐‐‐--‐-‐+‐‐‐‐‐‐-‐‐‐‐‐‐+‐‐‐‐------ 173985 |1802| 0 |569| 0 |0
All is well! Many
insertFemale not. Although the locks on the pieces of the hash tablets remained. But what to do, these are the properties of the bus of our little supercomputer.
This patch was also proposed to the community . But whatever the fate of these patches in the community, nothing prevents them from getting into the next version of Postgres Pro Enterprise , which is designed just for customers with heavily loaded systems.
sharehighly moral, lightweight blocks — the blocks that pass into the queue
exclusive— have solved the problem of hourly delays in a multi-node system. The hash label
buffer manager-a did not work because of an excessively large stream of
share locks which did not leave a chance for locks necessary for crowding out old buffers and loading new ones. Problems with expanding the buffer for the base tables were only a consequence of this. Before that, we managed to embroider a bottleneck with access to the root
PostgreSQL was not designed for NUMA architectures and supercomputers. Adapting Postgres to such architectures is a great job that would require (and possibly require) the coordinated efforts of many people and even companies. But the unpleasant consequences of these architectural problems can be mitigated. And we have to: the types of load that led to delays, similar to those described, are quite typical, we continue to receive similar distress signals from other places. Similar troubles were manifested before - on systems with a smaller number of cores, the consequences were not so monstrous, and the symptoms were treated in other ways and other patches. Now another medicine has appeared - not universal, but obviously useful.
So, when PostgreSQL works with the memory of the entire system as if it is local, no high-speed bus between nodes can compare with the access time to local memory. Tasks arise because of this difficult, often urgent, but interesting. And the experience of solving them is useful not only decisive, but also the entire community.