Failsafe resource allocator over DHT

    We have a certain range of numbers from 0 to N, we need to write two functions int alloc () and free (int). The first one selects one of the free identifiers from the range [0, N), and the second, accordingly, “returns” it for reuse (we assume that the number N is small enough so that the identifiers could end if they are not returned, but more than the number allocated in each specific time point of identifiers). At the same time, at the "lower level" we have only DHT. There are no locks, and, in addition, fault tolerance is required from the algorithms - if any of the cluster nodes “develops” during the execution of the algorithm, the behavior of the system should be predictable. If the task is interesting, and it is also interesting to find out why a fail-safe service with such a signature cannot be used correctly, and how to fix the signature so that it becomes possible - welcome under cat.



    API



    In fact, let's start by clarifying the API of our “allocator”. As noted above, the current API is incorrect - it cannot be used in a distributed system, because the calling code will not always be able to understand the particular state of a particular pool element.

    A simple example.
    1) Suppose I call the allocate () function, it succeeds, but the calling code stops working when the response from allocate () is sent. Problem - the identifier stood out, but I don’t know anything about it. I physically have no way to write some kind of recovery procedure and delete it. On the other hand, the code responsible for allocating the resource has already worked, from its point of view, the identifier is allocated. And he himself will not be able to remove it, because he must wait for the call free. Which will not be.
    2) The idempotency of the free () function is required. Obviously, the calling code must be written so that in any case, call free (). Obviously, having received the response from allocate, we will put it into persistent memory (in RDBMS, key-value storage, or just print it on the printer and load a piece of paper in the scanner). And then, at a certain point in time, when some node decides to delete the resource, it will call free (). However, if at the moment the free node is running, the node turns off, our “calling party” will obviously try to call it again (how exactly this will be implemented is not important. It is important that this is implemented). This is where an unpleasant surprise awaits us. The fact is that free could still free the resource. And even more than that, some third-party alloc () could highlight it again. Then we will free the resource again, but this time it will already be someone else's resource.

    Generally speaking, there are several ways to solve these problems with the API.

    The simplest, and quite universal, is that we attach a tag to each allocation of resources, which is selected unique for each pair of alloc / free operations.

    That is, the method signature is changed to int alloc (T t) and free (int id, T t). No matter what type of T, it is important that there exists a certain equivalence relation on it, and that we can solve the two problems described above by it. First, the free method can be made idempotent. Secondly, we can introduce a relatively simple procedure for deleting “phantom” identifiers that were allocated, but which we did not receive Old information on (see problem 1). We won’t dwell on this in order not to inflate the article.

    Note that you can almost always use a service changed in this way almost always, because in most cases we allocate “identifiers” to something, and we can always use this “something” as a tag.

    Now a little more, it remains to understand exactly how alloc and free behave in the case of "partial" execution, which would, in turn, understand how we write the calling code, and whether it is possible to write it.

    Obviously, in the general case, the calling code cannot know whether alloc was executed or not. So the first rule for the calling code will read

    • The calling code must first write somewhere the identifier by which alloc is called, and only then call it. Due to this, the calling code will be able to implement the recovery procedure, where it goes in the reverse order - iterates over all the selected pairs (id, t), and calls free for all pairs that are not mentioned in its storage.
    Similarly, we cannot find out about the free method - whether it was called or not.
    Two more rules are born from here:
    • The identifier passed in free should be considered “free” in the system (that is, no longer used) BEFORE calling the free method
    • And, nevertheless, the free method should be called (validly several times) until, although one call would not succeed.

    Well, let's move on to implementation.

    Implementation



    I remind you that at the entrance we have DHT. For people writing in Java, we will assume that we have ConcurrentMap (its interface, by the way is completely trivial, will be clear to developers using other languages). At the same time, ConcurrentMap of different hosts "looks" at the same content.

    By themselves, operations in DHT are atomic (this is fully supported by a number of implementations of DHT).

    However, there is no atomicity between the challenges in DHT, and we need to take this part into account and support in its implementation.

    So, implementation.

    It is obvious that in DHT we will have some kind of card of the form id-> t (or vice versa). The problem is how to get to the first free id without going through all the busy nodes?

    Such a solution is proposed. Let's build a binary interval tree such that
    • The root of the tree is the interval [0, N)
    • Under the non-leaf node X [a, b) there are exactly two nodes [a, c) and [c, b), where c is selected so that the lengths of these intervals are equal
    • And intervals of the form [id, id + 1) are leafs of the tree.
    Obviously, such a tree has N sheets, each of which can be reached from the root in ln (N) transitions.

    Now let's put in each non-leaf node of the tree the number of reserved identifiers on the interval that this node represents. And in the leaf nodes of the tree we will put the tag if the corresponding identifiers are busy, and null if it is free.
    Now it remains to save this tree in DHT. In addition, it is important to update the information in its nodes in the correct order, because, I remind you, our algorithm should work correctly, even if it interrupts at an arbitrary step.

    Suppose we have an Interval structure with two fields - from and to, which defines a half-open interval (includes from, but does not include to). Obviously, such a structure encodes a tree node. It should also be noted a relatively trivial moment - there are no duplicate nodes in the tree, and, moreover, nothing is needed to determine the position of an interval in such a tree. In other words, we do not need to bind tree elements with “pointers”. We can always build an algorithm that can determine from the node X which nodes are “below” it and which node is “above” it.
    This is a fairly important property that simplifies our lives.

    Let's put in this structure there are the necessary constructors, obvious equals and hashcode, and also there are three more useful methods: left (), right () - to determine the child nodes and up () - to determine the parent. With designers and hash codes, I think everyone can handle it. With left () and right () too. Using the up () function will be somewhat more complicated, but it is possible to write it (we will return to it at the end of the article so as not to blur the text with unnecessary details).

    Now, having the “magical” Interval structure that can code our interval tree, the matter remains small.
    In allocate, we need to select some non-empty sheet and write a tag into it, and then go up from the selected sheet and update their values. It’s important to remember two things.
    • The tag is added to the list with the atomic operation putIfAbsent
    • Values ​​in non-leaf nodes are updated through a loop with read-modify-atomicUpdate. Namely, we read the value from the updated node, read the value in the child nodes, write the sum of the children through the atomic replace to the updated node. If atomic replace did not work, repeat the operation.
    • Even if we read from DHT that there are free leaves in the sub-tree of the node, this does not mean that they really are there.
    Therefore, our allocate will be written as a “until success” cycle from the failfast procedure for finding and updating an empty sheet. Note that the first modification of the tree that we do is updating the sheet, and only then - updating the parent nodes. This is necessary because allocate can end at any time.

    Free works the same way - it “releases” the identifier (that is, removes the node from DHT) if the tag in it matches the requested one and updates the nodes up the tree. At the same time, updating the nodes matches the way it goes to allocate. The only thing that is important to note is that free performs tree updating even if the leaf node does not contain a value (i.e., if free has already been called). This is necessary because the previous free could be partially executed, freeing the sheet, but not updating the parent nodes. Actually, just for this we need a requirement in the API that free can be called several times, and that it must be called again if there is any doubt that the last time its call completed successfully.

    Function up



    Actually, the promised up () function. It is worth noting that the spelling of left (), right () and up () is greatly simplified if we agree that the interval of the top level has a length of 2 ^ n. Then any other interval in the tree will have a length of 2 ^ n, in addition, intervals at one "level" of the tree will have the same length. Generally speaking, all three sabzh functions can be written for an upper-level interval of an arbitrary form (but then, in addition to the “current” interval, you will always need to know the upper-level interval). But this is inconvenient, and generalizing to 2 ^ n does not harm anything.

    Thus, with the length of the intervals 2 ^ n, left () and right () are written trivially, and for up () you only need to determine whether the current interval is “right” or “left” for its parent interval. This is determined by a simple rule - at the "left" intervals, the initial value is divided without a remainder by twice the length of the interval. Accordingly, knowing about the interval, whether it was obtained from the left () or right () function, it is easy to apply the opposite action and get the upper level interval.

    Also popular now: