Reading postgresql: scandals, intrigues, investigations
I already talked about monitoring postgresql requests , at that moment it seemed to me that I completely figured out how postgresql works with various server resources.
With constant work with statistics on postgres requests, we began to notice some anomalies. I got to understand, at the same time I once again admired the comprehensibility of the source code of postgres)
Under kat a short story about the unobvious behavior of postgresql.
SELECTs "dirty" pages
That is, SELECT causes a modification of some records that postgres will write to disk.
I'll start with a brief explanation of the MVCC mechanism used by postgres to ensure transactional integrity.
All changes in the database occur during transactions, each transaction has an identification number txid (int32).
Postgres operates on table data in the form of the so-called tuple (tuple). Tuple carries directly the data of a particular line in the table, as well as metadata associated with this data:
Image: www.interdb.jp
xmin - the number of the transaction that created this tuple
xmax - the number of the transaction that marked this tuple as remote
- If we do an INSERT in a table, it creates a new tuple (xmin = txid)
- DELETE - marks dummies that match the condition as deleted (xmax = txid)
- UPDATE does conditionally DELETE + INSERT.
When we perform SELECT, in addition to directly searching and selecting data from the table, it also makes a visibility check.
Very simplified, some transaction with the number txid1 "sees" this tupl if the conditions are met:
xmin < txid1 < xmax
But changes in tuples occur immediately, and the transaction can take a long time, so during the visibility check, you need to make sure that transactions with xmin, xmax numbers have completed and if so, with what status. Postgres stores information about the current state of each transaction in CLOG (commit log).
Since checking the status of a large number of transactions in CLOG is quite expensive by resources, the developers decided to "cache" this information directly in the header of the tuple. That is, when some SELECT sees, for example, that xmin has completed, it saves it in the so-called hint bits - a structure on top of the infomask in which the transaction states xmin and xmax are written.
How does the change of dummies happen while reading, we figured out, it remains to remember what the "pages" are and why they are "dirty")
The fact is that working with data in memory and on disk is almost always more efficient than large blocks. Such a block in postgres is the "page", it contains a certain number of blueprints and meta-information about them. When we modify at least one page tuple, all of it is marked as "dirty", that is, it differs by state on the disk, and must be synchronized. Moreover, almost always the changes are also written to the WAL in order to be able to restore data integrity after the abnormal termination of the database process.
SELECT can cause synchronous write to disk
As you know, all work with data in pg is done through the buffer cache, if the necessary data is not there, the postgres will read it from the disk (using the OS page cache) and put it in the cache.
At the same time, if there is no place in the cache, the least demanded page is pushed out of it. And finally, if the preemptive candidate page is dirty, it should be written to disk at the same time.
FrozenTransactionId
At the beginning of the article, I mentioned that the transaction counter in postgres is 32-bit, that is, it is reset every ~ 2 billion transactions.
So that the visibility check does not turn into a pumpkin when resetting the transaction counter, there is a special process - wraparound vacuum.
Prior to version 9.4, this process replaced xmin in the dummy with the special value FrozenTransactionId = 2. A transaction with this number was considered older than any other transaction. With 9.4, a flag is simply put in the tuple that xmin is "frozen", and xmin itself remains unchanged.
For the most attentive: there is a special constant BootstrapTransactionId = 1, which is also older than all other transactions)
Total
Most cases of "strange" (in the philistine opinion) postgres behavior are caused by performance optimization.
While tinkering with postgres, I found a wonderful book "The Internals of PostgreSQL" , I recommend it to everyone who has not met before.