Silent Ruby Exposures: Transactional Rails / PostgreSQL Thriller

Original author: Andrey Novikov
  • Transfer

This is a story about why you should never keep silent about mistakes when you are inside a transaction in a database. Learn how to properly use transactions and what to do when using them is not an option. Spoiler: it's about advisory locks in PostgreSQL!


I worked on a project in which users can import a large number of heavy entities (let's call them products) from an external service into our application. Each product is loaded with even more of a variety of related data from an external API. There is often a situation where a user needs to load hundreds of products along with all-all dependencies, as a result, the import of a single product takes considerable time (30-60 seconds), and the whole process can take so long. The user may be bored waiting for the result and he has the right to click the “Cancel” button at any time and the application should be useful with the number of products that were able to download by this time.


“Interrupted import” is implemented as follows: in the beginning, for each product, a temporary record-task is created in the table in the database. For each product, the background import task is launched, which downloaded the product, saved it to the database along with all the dependencies (it does everything in general) and right at the end deletes its entry-task. If at the moment when the background task starts, there will be no records in the database - the task simply ends quietly. Thus, to cancel the import, simply delete all the tasks and everything.


It doesn’t matter whether the import was canceled by the user or completely completed by himself - in any case, the lack of tasks means that everything is over and the user can start using the application.


The design is simple and reliable, but there was one small bug in it. A typical bug report about him sounded like this: “After the import is canceled, the user is shown a list of his goods. However, if you refresh the page, the list of products is complemented by several entries. ” The reason for this behavior is simple - when the user pressed the Cancel button, he was immediately transferred to the list of all products. But at this time, the already started imports of certain goods are still “reaching”.


This, of course, a trifle, but the users order puzzled, so it would be nice to fix it. I had two ways: to somehow determine and “kill” already running tasks, or when I press the cancel button, wait until they finish and “die their own death” before pushing the user further. I chose the second way - wait.


Transactional locks to the rescue


For anyone working with (relational) databases, the answer is obvious: use transactions !


It is important to remember that in most RDBMS the records updated within the transaction will be blocked and inaccessible for modification by other processes until this transaction is completed. Records selected using will also be locked SELECT FOR UPDATE.


Similarly, our case! I wrapped the tasks of importing individual goods into a transaction and blocked the task entry at the very beginning:


ActiveRecord::Base.transaction do
  task = Import::Task.lock.find_by(id: id) # SELECT … FOR UPDATE значит «попридержи эту запись для меня»returnunless task # Её кто-то удалил? Значит, можно ничего не делать!# Делаем много тяжёлых операций
  task.destroy
end

Now, when the user wants to cancel the import, the import stop operation will delete the tasks for imports that have not yet started and will have to wait for the completion of the ongoing ones:


user.import_tasks.delete_all # ждём тут завершения всех уже идущих импортов

Simple and elegant! I drove the tests, checked the imports locally and staging, and put them into battle.


Not so fast…


Satisfied with my work, I was quite surprised to find out soon bug reports and tons of errors in the logs. Many products were not imported at all . In some cases, only one single product could remain after the completion of the entire import.


Errors in the logs also did not inspire: PG::InFailedSqlTransactionwith the backtrace leading to the code that executed the innocent SELECTs. What is going on at all?


After a day of grueling debugging, I identified three main causes of the problems:


  1. Inserting conflicting entries into the database.
  2. Automatic cancellation of transactions in PostgreSQL after errors.
  3. Hiding problems (Ruby exceptions) in the application code.

Problem One: Competitive Insertion of Conflicting Entries


Since each import operation takes up to a minute and there are a lot of these tasks, we perform them in parallel to save time. Dependent records for goods can intersect, to the point that all user products can refer to one single record created once and then reused.


To find and reuse the same dependencies in the application code, there are checks, but now when we use transactions, these checks have become useless : if transaction A has created a dependent record, but has not finished yet, then transaction B cannot know about its existence and will try to create a duplicate record


Problem two: Automatic abolition of transactions in PostgreSQL after errors


We, of course, prevented the creation of duplicate tasks at the database level using the following DDL:


ALTERTABLE product_deps ADDUNIQUE (user_id, characteristics);

If transaction A is still running, it inserts a new record and, in parallel with it, transaction B tries to insert an entry with the same field values user_idand characteristics- transaction B will get an error:


BEGIN;
INSERTINTO product_deps (user_id, characteristics) VALUES (1, '{"same": "value"}');
-- Now it will block until first transaction will be finished
ERROR:  duplicate key value violates unique constraint "product_deps_user_id_characteristics_key"
DETAIL:  Key (user_id, characteristics)=(1, {"same": "value"}) already exists.
-- And will throw an error when first transaction have commited and it is become clear that we have a conflict

But there is one feature about which we must not forget - transaction B will be automatically canceled after an error is detected and all the work done in it will be a waste. However, this transaction is still open in the “erroneous” state, but any attempt to execute any, even the most innocuous request, only errors will be returned in response:


SELECT * FROM products;
ERROR:  current transaction is aborted, commands ignored until endoftransactionblock

Well, it is needless to say that everything that was entered into the database in this transaction will not be saved:


COMMIT;  -- Даже если мы попытаемся сохранить всё, что сделалиROLLBACK-- РСУБД просто отклонит наш запрос и закроет транзакцию без изменений

Problem Three: Silencing Problems


By this point, it had already become clear that simply adding transactions to the application had broken it. There was no choice: I had to dive into the import code. In the code, the following patterns quite often caught my eye:


defprocess_stuff(data)# Магия, много магииrescue StandardError
  nil# Счастливой отладки, сукиend

The author of the code here, as it were, tells us: "We tried, we did not succeed, but that's okay, we continue without it." And although the reasons for this choice can be quite explainable (not everything can be processed at the application level), this is exactly what makes any logic based on transactions impossible: the “ejected” statement cannot pop up to the block transactionand cannot cause a correct rollback. (ActiveRecord catches all errors in this block, rolls back the transaction and throws them again).


Perfect storm


And that's how all these three factors came together to create the perfect storm bug:


  • An application in a transaction tries to insert a conflicting entry into the database and at the same time causes a "duplicate key" error from PostgreSQL. However, this error does not cause a rollback of the transaction in the application, as it is “silenced” inside one of the parts of the application.
  • The transaction becomes invalid, but the application does not know about it and continues its work. When any attempt to access the database, the application again receives an error, this time "current transaction is aborted", but this error can also be "thrown out" ...
  • You probably already understood that something in the application continues to break, but no one will know about it until the execution reaches the first place where there is no too greedy rescueand where the error can eventually emerge, be displayed in the log, in the error tracker - anything. But this place will be very far away from the place that became the root cause of the error, and this alone will turn debugging into a nightmare.

Alternative to transactional locks in PostgreSQL


Hunting for rescuein the application code and rewriting the entire import logic is not an option. Long. I needed a quick fix and postgres found it! It has a built-in solution for blocking, blocking alternative entries in transactions, meet -  sessional recommendation lock (session-level advisory locks). I used them as follows:


First, I removed the wrapping transaction first. In any case, it is a bad idea to interact with external APIs (or any other side effects) from the application code with an open transaction, because even if you roll back the transaction along with all the changes in our database, changes in external systems will remain , and the application as a whole may be in a strange and undesirable state. Heme isolator can help you make sure that side effects are properly isolated from the transaction.


Then, in each import operation, I take a shared lock on some key that is unique for the whole import (for example, created from a user ID and a hash from the name of the operation class):


SELECT pg_advisory_lock_shared(42, user.id);

Shared locks on the same key can be taken simultaneously by any number of sessions.


The operation of canceling the import at the same time removes all task records from the database and tries to take an exclusive lock on the same key. In doing so, she will have to wait until all shared locks are released:


SELECT pg_advisory_lock(42, user.id)

And it's all! Now the “cancellation” will wait until all the already “running” imports of certain goods are completed.


Moreover, now that we are not bound by a transaction, we can use a small hack to limit the waiting time for the import to be canceled (in case some import gets stuck), because it’s not good to block the web server’s flow for a long time (and force user wait):


transaction do
  execute("SET LOCAL lock_timeout = '30s'")
  execute("SELECT pg_advisory_lock(42, user.id)")
rescue ActiveRecord::LockWaitTimeout
  nil# мы устали ждать (в этот момент транзакция уже откачена)end

It istransaction safe to catch the error outside the block , because ActiveRecord will already roll back the transaction .


But what about the competitive insertion of the same records?


Unfortunately, I do not know a solution that would work well with competitive inserts. There are the following approaches, but they will all block parallel inserts until the first transaction completes:


  • INSERT … ON CONFLICT UPDATE (available starting from PostgreSQL 9.5) in the second transaction will be blocked until the first one is completed and then returns the record that was inserted by the first transaction.
  • Block some common record in a transaction before running validations to insert a new record. Here we will wait until the record inserted in another transaction becomes visible and validation can not fully work.
  • To take some general recommendation blocking - the effect is the same as for blocking the general record.

Well, if you are not afraid to work with base level errors, you can just catch the uniqueness error:


defimport_all_the_things# Начните транзакцию здесь, не раньше
  Dep.create(user_id, chars)
rescue ActiveRecord::RecordNotUnique
  retryend

Just make sure that this code is not wrapped in a transaction.


Why are they blocked?

The UNIQUE and EXCLUDE restrictions block potential conflicts by not allowing them to be recorded at the same time. For example, if you have a unique constraint on an integer column and one transaction inserts a row with a value of 5, then other transactions that also try to insert 5 will be blocked, but transactions that try to insert 6 or 4 will immediately execute successfully, without blocking. Since the minimal actual transaction isolation level in PostgreSQL is this READ COMMITED, the transaction is not entitled to see uncommitted changes from other transactions. thereforeINSERTwith a conflicting value cannot be accepted or rejected until the first transaction commits its changes (then the second receives a unique error) or rolls back (then the insertion in the second transaction will pass successfully). Read more about this in the article by the author of EXCLUDE restrictions .

Prevent catastrophe in the future


Now you know that not all code can be wrapped in a transaction. It would be nice to make sure that no one else would wrap a similar code in a transaction in the future, repeating my mistake.


To do this, all your operations can be wrapped in a small auxiliary module that will check if the transaction is not open before running the wrapped operation code (here it is assumed that all your operations have the same interface - method call).


# Так объявляется вспомогательный модульmoduleNoTransactionAllowedclassInTransactionError < RuntimeError;enddefcall(*)returnsuperunless in_transaction?
    raise InTransactionError,
          "#{self.class.name} doesn't work reliably within a DB transaction"enddefin_transaction?
    connection = ApplicationRecord.connection
    # service transactions (tests and database_cleaner) are not joinable
    connection.transaction_open? && connection.current_transaction.joinable?
  endend# И так используетсяclassDeps::Import < BaseService
  prepend NoTransactionAllowed
  defcall
    do_import
  rescue ActiveRecord::RecordNotUnique
    retryendend

Now, if someone tries to wrap a dangerous service in a transaction, then he will immediately get an error (if, of course, he will not be silent about it).


Results


Main lesson to be learned: be careful with exceptions. Do not process everything, catch only the exceptions that you know how to handle and let the rest get to the logs. Never suppress exceptions (only if you are not 100% sure why you are doing this). The earlier the error is noticed, the easier it will be to debug.


And do not overdo it with transactions in the database. This is not a panacea. Use our isolator and after_commit_everywhere gems - they will help your transactions become completely foolproof.


What to read


Exceptional Ruby by Avdi Grimm . This little book will teach you how to handle existing exceptions in Ruby and tell you how to properly design an exception system for your application.


Using @Brandur's Atomic Transactions to Power an Idempotent API . In his blog, a lot of useful articles about the reliability of applications, Ruby, and PostgreSQL.


Also popular now: