How do we catch deadlock `s on PostgreSQL and repair them

Foreword

Situation: there is a high-loaded meta-game for our tanks called the Global Map . Such a step by step nastolka for teams where battles take place in a real tank client. During peak hours on the map, several thousand clan leaders produce game actions: they attack each other, move divisions, buy, sell, ~~rob korovanov~~ . In addition, there are a dozen services that can also make changes to the game situation: they throw up money, fine, add players to the clan, and so on.

All this inevitably leads to deadlocks. So, I want to tell you a story about how we keep these periodic problems within acceptable limits.

A little about the internal structure of the backend

The main database is PostgreSQL 9.5.
The transaction isolation level is the standard default READ COMMITED.
ORM - SQLAlchemy.

Part 1: Monitoring

How Deadlock Appears

When Deadlock occurs, an exception of the following form falls: The first thing you should pay attention to is the line: HINT: See server log for query details. Indeed, if we look at the server logs, we will see the following for the same place: And then the specifics: And, finally, the query on which the error occurred:

ERROR: deadlock detected

DETAIL: Process 18293 waits for ShareLock on transaction 639; blocked by process 18254.

Process 18254 waits for ShareLock on transaction 640; blocked by process 18293.

HINT: See server log for query details.

CONTEXT: while updating tuple (0,9) in relation "users"

ERROR: deadlock detected

DETAIL: Process 18293 waits for ShareLock on transaction 639; blocked by process 18254.

      Process 18254 waits for ShareLock on transaction 640; blocked by process 18293.

      Process 18293: update users set balance = balance + 10 where id = 2;

      Process 18254: update users set balance = balance + 10 where id = 3;

HINT: See server log for query details.

CONTEXT: while updating tuple (0,9) in relation "users"

STATEMENT: update users set balance = balance + 10 where id = 2;

Query logging is not required to be enabled.

Cool. But the first global problem for any more or less serious project is that you do not have access to server logs due to security policy. Sometimes there is no access at all. And sometimes you can ask for a site, but you have to wait. Sometimes it's 30 minutes, sometimes a day.

And I would like to receive such information immediately. In particular, if you have a Sentry in the project, and the development team receives most of the errors right away.

Somehow you can not tweak the server so that it gives out such information to regular customers. Due to security policy developers base. But, if your user has normal access to the database, without any restrictions on the performance of service functions and without Row-Level security policies , you can still organize access to such information.

Hand grip

We can convert our classes to manually receive similar information. And even more. To do this, after catching the exception about deadlock, we need:

Temporarily do not roll back anything in the current connection with the database and do not touch anything there at all.
Create another connection and execute the simplest query:
```
SELECT * FROM pg_stat_activity;
```
The results are put in a repository accessible for developers. For example, send by Sentry as an error.

Example: implementation of collecting information on SQLAlchemy deadlock

db.py

from contextlib import contextmanager
from sqlalchemy import create_engine
from sqlalchemy.exc import OperationalError
engine = create_engine('postgresql+psycopg2://postgres:12345678@localhost/postgres')
deflog_pg_stat_activity():'''Log, write or send through Sentry pg_stat_activity'''
    debug_conn = engine.connect()
    for process in debug_conn.execute('''
        SELECT pid, application_name, state, query FROM pg_stat_activity;
    ''').fetchall():
        print(process)
@contextmanagerdefconnection():
    conn = engine.connect()
    try:
        yield conn
    except OperationalError as ex:
        log_pg_stat_activity()
        raise@contextmanagerdeftransaction():with connection() as conn:
        with conn.begin() as trans:
            try:
                yield conn
            except OperationalError as ex:
                if'deadlock detected'in ex.args[0]:
                    log_pg_stat_activity()
                    # Log exception
                    print(ex)
                    trans.rollback()
                else:
                    raise

deadlock_test.py

from multiprocessing import Process
from time import sleep
from threading import Thread
from sqlalchemy.orm import sessionmaker
from db import transaction
defprocess1():with transaction() as tran:
        tran.execute('UPDATE users SET balance = balance + 10 WHERE id = 3;')
        sleep(1)
        tran.execute('UPDATE users SET balance = balance + 10 WHERE id = 1 RETURNING pg_sleep(1);')
defprocess2():with transaction() as tran:
        tran.execute('UPDATE users SET balance = balance + 10 WHERE id = 1;')
        sleep(1)
        tran.execute('UPDATE users SET balance = balance + 10 WHERE id = 3 RETURNING pg_sleep(1);')
if __name__ == '__main__':
    p1 = Thread(target=process1)
    p2 = Thread(target=process2)
    p1.start()
    p2.start()
    sleep(4)

Or on github .

In this case, we have a high probability that we will see which particular query broke our transaction, calculating it from the PID and seeing the current query .

But it also happens that by calculating the connection by PID and looking at the query you can see at all the wrong query that gave us the deadlock, but some logic following it. After all, while you were catching an exception and opening the connection, the request we needed for catching could have ended. All we can do here is work through pgBouncer or its analogs to minimize connection time and use application_name .

application_name

Even if you received the request that caused the deadlock, you may still have difficulty understanding where in the logic it was called. And here comes the application_name field . By default, it is initialized with not very useful information, but it can be changed. And what if we write there the place where we started the transaction?

No sooner said than done!

Example: implementing the installation of application_name via SQLAlchemy

db.py

from traceback import extract_stack
@contextmanagerdeftransaction(application_name=None):with connection() as conn:
        if application_name isNone:
            caller = extract_stack()[-3]
            conn.execution_options(autocommit=True).execute("SET application_name = %s", '%s:%s' % (caller[0], caller[1]))
        with conn.begin() as trans:
            try:
                yield conn
            except OperationalError as ex:
                if'deadlock detected'in ex.args[0]:
                    log_pg_stat_activity()
                    # Log exception
                    print(ex)
                    trans.rollback()
                else:
                    raise

Or on github .

Voila Now you can quickly open files in the right places and watch the code.

	pid	application_name	state	query
one	8613	deadlock_test.py:10	idle in transaction (aborted)	UPDATE users SET balance = balance + 10 WHERE id = 1 RETURNING pg_sleep (1);
2	8614	deadlock_test.py:17	active	UPDATE users SET balance = balance + 10 WHERE id = 3 RETURNING pg_sleep (1);
3	8617		active	SELECT pid, application_name, state, query FROM pg_stat_activity;

We think about server logs

All this magic described above is good, but theoretically it might not work. Sometimes you still can not do without server logs, so you need to take two more steps:

Discuss a clear procedure for obtaining the necessary parts of server logs at a reasonable time with interested parties.
Do them in the format you want by changing the log_line_prefix in postgresql.conf . On the development machine for example can be as follows: log_line_prefix = 'APP:%a PID:%p TR:%x '.

Part 2: How to deal with deadlocks

Typically, these sections provide links to documentation about transaction isolation levels, types of locks, and offer to think and analyze in the context of your application. I'll do it too, but later. But first, I’ll just describe how we do it most often, for it has already happened that the deadlocks are very similar to each other.

Several practices of avoiding deadlock `s

Frequent case number 1: Classic deadlock

Our most frequent case is as follows:

A game action occurs, affecting several players at once.
In this case, players can often perform several actions at the same time. For example, we have: large clans have many units involved in the battles.
When you win or lose points are added to the clan. And in parallel such charges can be very much.

That is, we have a situation when the heels of the results of battles came, their processing was parallelized, while in parallel processes, it happens that the same players are encountered.

All that can be done here is either to build charges in a chain, but this is slow, or to let charges fall and try to charge a bit later.

Frequent case number 2: Himself angry Pinocchio (SSZB)

We have a hiking game. Once in a turn, the players' balance is recalculated, given the large number of game actions they have committed. At the time of the balance change, we blocked other changes through SELECT ... FOR UPDATE . Although we did not block everyone at once, but chunks of 100, anyway, we sometimes went into deadlock with a process that accrues bonuses for a fight that does not stop for the duration of the course calculation.

So, it turned out that we were wrong. SELECT ... FOR UPDATE is too powerful a lock, necessary only if 2 conditions are met:

The conditional id of the current table is used as a foreign key in another.
The same conditional id can be changed / deleted as a result of further actions.

Take an example:

--P1: BEGIN;
--P2: BEGIN;
--P1: SELECTidFROM clans WHEREid=1FORUPDATE;
--P2: INSERTINTOusers(clan_id, name) VALUES(1, 'Alpha');

P2 in this situation will hang, because we give the DBMS to understand that the record with id = 1 may cease to exist. However, in P1 we do not do anything like that, just want to protect the balance of the clan from changes. Therefore, when we changed FOR UPDATE to FOR NO KEY UPDATE , we stopped catching deadlocks.

Bonus number 1

SELECT ... FOR UPDATE is explicitly called in the example above. But you will get a similar effect if you affect your changes with a unique key referenced by a foreign key from other tables. And any UPDATE that does not affect such changes with such keys will cause a lock similar to SELECT ... FOR NO KEY UPDATE . I recommend that you familiarize yourself with these features in the article “Explicit locks” in the list of references below.

Bonus number 2

Let's return to another interesting detail from the original error:

CONTEXT: while updating tuple (0,9) in relation "users"

What kind of stupid you ask? This is the physical address of the line in the table that caused the conflict. The fact is that in each table there are service fields that are not selected by the SELECT * query . However, it is necessary to clearly indicate for example ctid among the fields, as we will see this very stupid:

SELECT ctid, id, nickname, balance FROM public.users;

There is little benefit from it in the case of deadlock, because an unlocked process will most likely update the conflicting line, and this ctid will change (since any UPDATE in PostgreSQL is actually INSERT, and the old line is marked as invisible and later removed by auto-vacuum). But it is worth knowing that someday it will come in handy.

What to read

Deadlocks in PostgreSQL - description of all types of locks
Explicit locking or Explicit locking (in Russian). More on blocking.
Lock Monitoring - scripts showing how to monitor
System Columns locks - about hidden service fields in tables
Introduction to PostgreSQL physical storage - about physical data storage in PostgreSQL

Tags: