Kronos: no time travel even in distributed systems

    In distributed systems there are a number of fundamental problems: efficient distributed transactions, exactly-once data processing, accurate synchronization of physical clocks. To solve this problem , different types of logical clocks were invented .


    However, vector clocks have unpleasant properties: they introduce a conditional relationship between events where there is none, and lose it where it actually is.


    However, you can think of something more reliable - Kronos. In this article, we will look at the causal relationship accounting algorithm and its application for building a Key-Value repository with distributed transactions.


    image


    Problems


    As already mentioned, there are a number of problems with logical clocks:


    • Non-existent dependencies arise because the logical clock introduces a complete order on events — that is, any two events about any event can be said which conditionally-earlier and which conditionally-later. The contract is conditional, since it is impossible to determine the exact relationship between events in time, including by virtue of the Special Theory of Relativity.


    • On the other hand, logical clocks consider interconnection only through messages within the system. If any two events are connected, but out of the system, for example, through a user (adding goods to the cart in one part of the system -> order payment), then the logical clock may miss this relationship.


    • The logical clock cannot be accessed from the outside, and it is also difficult to interconnect several independent components (distributed file system, query processing services, analytics).



    Decision


    In a 2014 article, Kronos: The Design and Implementation Ordering Service proposes a solution - a stand-alone service that will deal with causal relationships in events.


    The main abstraction inside Kronos is the event on which partial order is introduced. The causal relationship is transitive — that is, if, for example, we know that the creation of the file precedes its change, and the change is preceded by the deletion, we can make a logical conclusion that the creation occurred before the deletion.


    The minimum API can be defined by the following set of methods:


    MethodResultComment
    create_event()eCreates a new event in Kronos.
    query_order([(e_1, e_2), ...])[<-, concurrent, ->, ...]For each pair of requests, returns the direction of causation, or simultaneity of events.
    assign_order([(e_1, e_2, must), (e_3, e_4, prefer), ...])OK/FAILFor each pair of the query sets the direction of causation
    acquire_ref(e)OKIncreases the reference count for this event.
    release_ref(e)OKReduces the reference count for this event.

    Implementation


    It is quite logical that the system is based on an event oriented graph, with an effective wide search for checking the relationship between events, a fault tolerance mechanism and garbage collection.


    As can be seen from the API, the request assign_orderalso takes the type of causation: mustor prefer. mustmeets stringent invariants - for example создание_объекта->удаление_объекта, preferthe same can not be applied if it conflicts with mustbonds. An example of use preferis the requests that came earlier, it is better to call earlier, but this does not affect the correctness.


    Effective BFS


    In our case, the graph may be large, but the events for which verification requests will be performed, as a rule, will be located close. Therefore, it is necessary to perform BFS faster for such cases.


    In the standard implementation, the longest place is the initialization of the array of visited vertices, which always takes time equal to the number of vertices in the graph. Instead, you can use a hash table or use other tricks.


    Garbage collection


    As can be seen from the table, there are two more methods: acquire_refand release_ref.


    Inside Kronos for each event is stored reference count. While some service handles the event, or reserves the ability to add new events that occur after the current one, it stores the link. When such a need disappears, the service calls release_ref.


    Kronos will delete the event when all conditions are met:


    1. Number of links reached zero
    2. All events preceding this one have already been removed from the graph.

    This approach does not limit possible requests, but saves memory inside Kronos.


    Applications


    Consider the use of the system on the example of Key-value store with distributed transactions.


    Let there are several servers, each server is responsible for a range of keys.


    Each transaction corresponds to an event in Kronos. For each key, the server must store the last transaction number in which the key participated. The client creates an event and sends its number to all servers whose keys are affected by this transaction. The server tries to create a dependency in Kronos between the current transaction number and the previous event that is stored for this key. If it is not possible to create a dependency, then the transaction is considered unsuccessful (note that there is no data interaction yet).


    If all the operations of adding dependencies have completed successfully - this means that the transaction will take place and it can be performed. Servers learn about this from the client and begin to perform parts of the transaction.


    Note that such transactions will be ACID :


    • Atomicity : the transaction either cannot be scheduled at Kronos, or it will be scheduled for execution on all nodes.
    • Consistency : automatically in KV-storages.
    • Isolation : if two transactions overlap according to data, then they will be linked by a causal link in Kronos, which means that one will be executed before the other.
    • Durability : since Kronos is resistant to drops and it is assumed that each replica of the repository is also stable, the only thing that needs to be proved is the persistence of these uncommitted transactions. Actually, if the transaction is marked by the client as successful, but the record has not yet been executed on the server, this fact is easy to establish, since the server also keeps records of the completed parts of the transactions.

    Performance


    Implementing such a KV storage can indeed be effective. The original article provides data that the described implementation of KV-storage surpasses the transaction-based implementation 4 times as fast as transactions.


    Moreover, in comparison with MongoDB, the system over Kronos is inferior by only 6%, despite the fact that MongoDB does not use distributed transactions.


    Analysis


    However, the operation of the Kronos has several disadvantages.


    • First, there is an overhead of accessing Kronos — every request will require at least one call.
    • Kronos will also be a single point of failure - the authors of the article do not suggest ways to partition the event graph.
    • It would be nice to add a number of methods to the system. For example, in the KV-storage example, it would be nice to have a callback that informs the server about the status of the transaction — it was successfully added to the graph with all the necessary dependencies — or, conversely, failed to complete the transaction.

    However, the described system allows for the flexible management of a causal relationship between events, ensuring predictable adherence to the necessary invariants.


    Conclusion


    This is what we at GoTo School teach students and schoolchildren in the direction of Distributed Systems.


    And then there are Algorithms and Applications , Application Programming, Bioinformatics and Data Analysis


    Come to our autumn school on October 27 - November 4 or winter school in early January.


    And if you are not a student and not a schoolboy, come to teach .


    Also popular now: