Trainee Vasya and his stories about idempotency API

Idempotency - it sounds complicated, they rarely talk about it, but this applies to all applications that use the API in their work.

My name is Denis Isaev, and I lead one of the backend groups on Yandex.Taxi. Today I will share with Habr readers a description of the problems that may arise if you do not take into account the idempotency of distributed systems in my project. To do this, I chose the format of fictional stories about intern Vasya, who is just learning to work with the API. It will be more visual and useful. Go.

About API

Vasya was developing an application for ordering a taxi from scratch and got the task of making an API for ordering a car. He sat days and nights and implemented an API of the form POST /v1/orders:

{
  "from": "Москва, ул. Садовническая набережная 82с2",
  "to": "Аэропорт Внуково"
}

When it was necessary to make an API for the return of active orders, Vasya thought: can it be necessary to order several taxi cars at the same time? The managers replied that no, such an opportunity is not needed. Nevertheless, he made an API for returning a list of active orders in general GET /v1/orders:

{
  "orders": [
    {
      "id": 1,
      "from": "Москва, ул. Садовническая набережная 82с2",
      "to": "Аэропорт Внуково"
    }
  ]
}

In a mobile application, the programmer Fedya supported the server API as follows:

when the application starts GET /v1/orders, we call , if we received an active order, then draw its state in the UI;
when you click on the button "order a taxi" we call POST /v1/orderswith the user data entered;
if any server error or network error occurs, we draw an error message and do nothing more.

As expected, autotests were written on the server code and application code, and before the release of the mobile application it was manually tested for 2 days. Testing found a number of bugs, they were quickly fixed. The application has been successfully launched to users and given an advertising campaign. Users left several positive reviews, thanked the developers, asked for new features. The development team and managers donated a successful launch and went home.

Button lock

At 8 a.m., Vasya was woken up by a call from support: two users complained that two cars came to them instead of one, and money was written off for both cars. Making coffee quickly, Vasya sat down at a laptop, connected via VPN and started digging logs, graphs and code. According to the logs, Vasya discovered that these users had two identical requests with a difference of several seconds. According to the graphs, he saw: at 7 am the database began to slow down and write requests to the database began to work in seconds instead of milliseconds. At this point, the reason for the slow queries has already been found and eliminated, but there is no guarantee that this will not happen again someday. And then he realized: the application does not block the “order a taxi” button after sending the request, and when the requests began to slow down, users began to press the button again, thinking that it had not been pressed for the first time.

The application began to lock the button: this fix was released in a few days. But the team had a few more weeks to receive such complaints and ask users to update the application.

In the underpass

Another similar complaint came, and the inertia support answered “update the application”. But then the user said that he already has the latest version of the application. Vasya and Fedya were pulled out of their current features and asked to figure out how so, because this bug has already been fixed.

Having spent two days excavating this isolated case, they found out what was the matter. It turned out that blocking the button was not enough: one of the users tried to order a taxi while in the underpass. Mobile Internet worked for him barely: when I clicked on the order button, the request went to the server, but the answer was not received. The application showed the message "an error occurred" and unlocked the order button. Who would have thought that such a request could have been successfully executed on the server, and the taxi driver was already on the way?

We chose the option to edit on the server, as this can be done on the same day, without waiting for a long rolling application. Vasya chose one of several correction options: before creating an order in the database, he selects user orders from the database with the same from and to parameters over the past 5 minutes. If such an order is found, then the server gives an error of 500. Vasya wrote autotests, and, by chance, ran them in parallel: one of the tests crashed. Vasya understood that there is a race between selection and insertion into the database with parallel requests from one user. According to the results of the bugs that happened, Vasya realized that the network could “blink” and the database could slow down, increasing the window of the race, so the case is quite real. How to fix it correctly was not clear.

Limits on the number of active orders

On the advice of a more experienced programmer, Vasya looked at the problem from the other side and went around the race using the following algorithm:

start a transaction;

UPDATE active_orders SET n=1WHERE user_id={user_id} AND n=0;

if you updatechanged 0 entries, then return the HTTP code 409;
Insert the order object in another table;
complete the transaction.

Upon receipt of a 409 response code, the application re-requested the list of active orders. The fix on the server was released on the same day, duplicates passed, and after the application was rolled out, users stopped seeing errors. Vasya and Fedya returned to their features.

Multi-order

A month passed, and a new manager came to Vasya: how many days can you make a “multi-order” feature: so that the user can order two taxi cars? Vasya is surprised: how so, I asked, and you told me that it would not be necessary ?! Vasya said that this is not fast. The manager was surprised: isn't it just raising the limit from 1 to 2? But the multi-order completely broke Vasin’s double protection scheme. Vasya had no idea how to solve this problem at all without introducing takes.

The key to idempotency

Vasya decided to study who struggles with such problems, and stumbled upon the concept of idempotency. An API method is called idempotent, the repeated call of which does not change state. There is a subtle point here: the result of an idempotent call may change. For example, when you call the idempotent order creation API again, the order will not be created again, but the API can respond both 200 and 400. With both response codes, the API will be idempotent in terms of server status (there is only one order, nothing happens to it) , and from the point of view of the client, the behavior is significantly different.

Vasya also learned that the HTTP methods GET, PUT, DELETE are formally considered idempotent, while POST and PATCH are not. This does not mean that you cannot make GET non-idempotent, but POST idempotent. But this is what many programs rely on, for example, proxies may not repeat POST and PATCH requests on errors, while GET and PUT may repeat.

Vasya decided to look at examples and came across the concept of idempotency key in some public APIs.

Yandex.Kassa allows clients to send, together with formally non-idempotent (POST) requests, an Idempotency-Key header with a unique key generated on the API client. It is recommended to use UUID V4. Stripe likewise allows clients to send an Idempotency-Key header with a unique key generated on the API client along with formally non-idempotent (POST) requests . Keys are stored for 24 hours. Among non-payment systems, Vasya found client tokens from AWS.

Vasya added the new required field idempotency_key to the POST / v1 / orders request, and the request became like this:

{
  "from": "Москва, ул. Садовническая набережная 82с2",
  "to": "Аэропорт Внуково",
  "idempotency_key": "786706b8-ed80-443a-80f6-ea1fa8cc1b51"
}

The application began to generate the idempotency key as UUID v4 and send it to the server. Upon repeated attempts to create an order, the application sends the same idempotency key. On the server, the idempotency key is inserted into the database in a field on which there is a unique database restriction. If this restriction did not allow an insert, then the code detected this and gave an error of 409. On Fedi’s advice, this moment was redone in the direction of simplifying the application: they began to give back not 409, but 200, as if the order had been successfully created, then there was no need to learn how to process the clients code 409.

Test bug

After that, the limit was simply raised from 1 to 2 and supported the change in the application. When testing the application, we found the following bug:

the user wants to create an order, the request arrives at the server, the order is created, testers emulate a network error and the application does not receive a response;
the user sees an error message, for some reason before that he still changes the destination, and only after that he clicks on the taxi creation button again;
the application does not change the idempotency key between requests;
the server detects that an order with this idempotency key is already there and gives 200;
an order was created on the server with the old destination, and the user thinks that he was created with the new destination, and leaves the wrong place.

First, Vasya suggested that Fede should generate a new idempotency key in this case. But Fedya explained that then there could be a double: with a network error in the request to create an order, the application cannot know whether the order was actually created.

Fedya noted that although this is not a solution, for early detection of such bugs on the server it was necessary to check that the parameters of the incoming request match the parameters of an existing order with the same idempotency key. For example, AWS returns an IdempotentParameterMismatch error in this case.

As a result, they both came up with the following solution: the application does not allow you to change the order parameters and endlessly tries to create an order while it receives 5xx response codes or network errors. Vasya added the server validation proposed by Feday.

Useful code review

Two problematic scenarios were found on the code review of the implemented solution.

Scenario 1: two taxis

the application sends a request to create an order, the request is executed for tens of seconds for some reason, the order is slowly being created;
the user cannot do anything in the application, and the taxi is not ordered, then he decides to completely unload the application from memory;
the user reopens the application, it makes a request GET / v1 / orders, and does not receive the order that is being created at the moment, since it has not yet been fully created;
the user thinks the application is buggy and makes the order again, this time the order is created quickly;
the creation of the first order sagged, and the order was created to the end;
two taxis arrive to the passenger.

Scenario 2: a canceled taxi arrived

the application sends a request to create an order, the order is created, but the mobile network lags, and the application does not receive a response about the successful creation of the order;
the dispatcher, or the user himself, through push for some reason cancels the order: the cancellation of the order is done as removing a row from the database table;
the application sends a second request to create an order: the request is successfully completed and another order is created, since the idempotency key stored in the previous order no longer exists in the table.

Vasya and Fedya considered simple options for how to fix both problems:

Scenario 1: the application stores all currently created orders even between application restarts. The application displays them in the interface immediately after launch, continuing attempts to create them, provided that not too much time has passed since their creation.
Scenario 2: go from deleting entries from the order table to setting the deleted_at = now () field - the so-called soft delete. Then limiting the uniqueness of the idempotency key would also work for canceled orders.
Scenario 3: to separate the abstraction of ensuring idempotency of requests from the abstraction of resources and store the used idempotency keys for a limited time separately from the resource, for example, 24 hours.

But the senior comrades proposed a more general solution: version the state of the list of orders. APIGET /v1/orderswould give a version of the list of orders. This is a version of the entire list of user orders, not a specific order. When creating an order, the application transfers the version that it knows about in a separate field or If-Match header. The server atomically with a change increases the version for any changes in orders (create, cancel, edit). That is, the application in a request to the server tells him what state of orders it knows. And if this state of orders (version) is at variance with what is stored on the server, then the server gives an error “orders were changed in parallel, reload order information”. Versioning solves both problems found, and it was Vasya and Fedya who supported him. It is also worth noting that the version can be either a number (the number of the last change) or a hash from the list of orders: for example, the parameter fingerprintinGoogle Cloud API for modifying instance tags.

Time to draw conclusions

Based on the results of all the alterations, Vasya thought and realized that any API for creating resources must be idempotent. In addition, it is important to synchronize knowledge of the list of resources on the client and server through versioning this list.

Idempotency deletion

One day Vasya received a telegram notification that the API had a response code of 404. Based on the logs, Vasya found that this happened in the cancellation API.

Cancellation of the order was done through the request DELETE / v1 / orders /: id. Inside, the order line was simply deleted. In soft delete (setting deleted_at = now ()) was not necessary.

In this situation, the application sent the first cancellation request, but it didn’t. The application, without notifying the user, immediately made a re-request and received 404: the first request was already completed and deleted the order. The user saw the message “unknown server error”.

It turns out that not only the creation, but also the removal of resources should be idempotent, thought Vasya.

Vasya considered the option of always giving 200, even if the DELETE request in the database did not delete anything. But this created a risk of concealing and skipping possible problems. So he decided to do soft delete and redo the undo API:

from the database, he began to select everything, even already canceled orders with the given id;
if the order has already been deleted, and this was within the last n minutes (that is, on regular retries), then the server began to give 200;
in other cases, the server returns 410 with the error "order does not exist." Vasya decided in passing to replace 404 with 410 as more suitable, since the code 404 means that this is a temporary error, and the request can then be repeated. Code 410 means that the error is constant, and retrying the request will produce the same result.

More such problems with cancellation did not come up.

Idempotency of change

Change point B

Vasya decided to check by code whether the API has change of trip idempotency: he already realized that absolutely any API should be idempotent.

In the application, the passenger can change point B. In this case, a request is sent PATCH /v1/orders/:id:

{
  "to": "новая точка назначения"
}

The server inside just executes updateinto the database:

UPDATE orders SETto={to} WHEREid={id}

Everything is idempotent nowhere - Vasya thought and was right. He just did not take into account the fact that with parallel changes and reading / changing there can be races, but this is a completely different story.

Is it necessary to fix

Vasya also checked the trip completion API: it is called by the driver application when the driver completed the order. On the server, the API marks the order completed and does a series of actions, including statistics. Among the read statistics, Vasya’s eyes fell on the metric of the number of completed orders from the user. When calling the API, the counter of completed orders was incremented by a query of the form

UPDATE user_counters SET orders_finished = {orders_finished+1} WHERE user_id={user_id}

It became clear that with repeated calls to the API, the counter may increase by more than 1.

Vasya thought: why do we need a counter at all, if it is possible to calculate the total number of such orders on the basis of each time? A colleague told him that, firstly, old orders go to separate repositories, and secondly, the counter is used in loaded APIs, where it is important not to make unnecessary queries to the database.

Vasya created a task in the task tracker to alter the calculation of the counter according to the following algorithm:

when creating an order, the counter does not change in any way;
a new procedure appears in the task queue, which fetches all user orders from both repositories, calculates the metric of completed orders and saves it to the database;
the task is queued from the API to complete the order: when you call the API again, in the worst case, the task will be executed in the queue several times, which is not scary.

After half an hour Vasya was asked by his head: why do this? After a little discussion, they had a mutual understanding that a rare discrepancy between the counters was acceptable. And to redo the scheme for the exact calculation of the metric is impractical for the business at this stage.

I checked everything

As a responsible trainee developer, Vasya checked all the places where the API might not be idempotent. But did he accurately check everything he needed?

Idempotency in external operations

Dublin SMS

In the middle of the working day, a worried manager runs to Vasya’s desk: on Facebook, the media personality wrote an angry post about our taxi application filling him with dozens of identical SMSs. You need to respond immediately, the post has already collected hundreds of likes.

Vasya carefully looked at the SMS sending code: first, the task was put in the queue, then when the task was executed, a request was made to the SMS gateway. Neither there nor there were no retries in case of errors. Where could duplicates come from, maybe the gateway or operator may have a problem? Then Vasya discovered that during the consumer takes, the queue repeatedly crashed. It dawned on him: the task is taken from the queue, executed, and marked completed only at the end of the execution.

It took two days to fix: for tasks sending SMS, email and push, the logic of marking the completed task has changed: marking began to be made at the very beginning of the execution. In terms of distributed systems, Vasya switched from "at least once delivery" to "at most once delivery". Monitoring was set up, it was agreed on the product that the non-delivery of notifications is better than duplicating them.

Conclusion

Through fictional stories, I tried to explain why it is so important that the APIs are idempotent. He showed what are the nuances in practice.

At Yandex.Taxi, we always think about the idempotency of our APIs. In a small project, it would be acceptable not to waste time working on rare cases. But Yandex.Taxi is tens of millions of trips every month. Therefore, we have a design review procedure for the architecture and API. If something is not idempotent, there are races, or logical problems, then the API will not pass the review. For developers, this means that you have to carefully consider the details and think through a lot of boundary cases. This is not a trivial task, and it is especially difficult to cover such boundary cases with autotests.

Do timeouts, re-requests, duplicates occur when the application does not have millions of users? Unfortunately yes. The described situations are typical for distributed systems: network errors occur regularly, hardware crashes regularly, etc. A well-designed system considers such errors to be normal behavior and can compensate for them.

Tags: