Exactly once is NOT exactly the same: article analysis


    I decided to analyze an article describing some interesting details of stream processing exactly once: exactly-once . The fact is that some authors understand terms very strangely. Analysis of the article just will clarify many details more deeply, because identifying illogical and oddities allows you to more fully feel the concepts and meaning.

    Let's get started


    It all starts very well:

    Distributed event stream processing. Notable Stream Processing Engines (SPEs) include Apache Storm, Apache Flink, Heron, Apache Kafka (Kafka Streams), and Apache Spark (Spark Streaming). One of the most notable factors is that it can be processed, “once-for-one”

    That is, data processing is extremely important, etc., and especially the topic under discussion is exactly-once. Let's discuss it.

    There is a lot of misunderstanding and ambiguity, however, it’s possible to ask for it.

    Indeed, it is very important to understand what it is. For this, it would be nice to give a correct definition in front of lengthy arguments. And who am I to give such damn sensible advice?

    It can be described as a matter of fact.

    Making up new terms is certainly an important task. I love this thing myself. Only this requires a justification. Let's try to find it.

    I will not describe the obvious things by the type of directed processing graphs and so on. Readers can independently read the original article. Moreover, for the analysis of these details of little importance. I will give only a picture:

    Next, the semantics are described:

    • At-most-once, i.e. no more than once. With seemingly obvious, this behavior is extremely difficult to guarantee in case of marginal scenarios such as falls, a violation of network connectivity, and more. But for the author, everything is simple:

    • At-least-once, i.e. at least once. The scheme is more complicated. And rake can gather up more:

    • Exactly once. What is exactly-once?

    If you’re not aware of what you’ve been doing, then you’ve been processed.

    Those. The exactly-once processing warranty is when a “exactly once” processing occurred.

    Feel the power of determination? I paraphrase: processing once is when processing happens "once". Well, yes, it also says that this guarantee should be retained in case of failures. But for distributed systems, this is an obvious thing. And the quotes hint that something is wrong here. To give definitions with quotes, without explaining what it means - a sign of a deep and thoughtful approach.

    Next comes a description of how to implement such semantics. And here I would like to stay in more detail.

    Two popular mechanisms are typically used to achieve “exactly-once” processing semantics.
    1. Distributed snapshot / state checkpointing
    2. At-least-once event delivery plus message deduplication

    If the first mechanism about snapshots and checkpoints does not cause questions, well, apart from some details such as efficiency, then there are some minor problems with the second, which the author kept silent about.

    For some reason, it is assumed that the processor can only be deterministic. In the case of a non-deterministic handler, each subsequent restart will give, generally speaking, other output values ​​and states, which means deduplication will not work, because output values ​​will be different. Thus, the general mechanism will be much more complicated than the one described in the article. Or, frankly, such a mechanism is incorrect.

    However, go to the most delicious:

    Is exactly-once really exactly-once?

    Now let's reexamine what the “user right now” is. The label “exactly-once” is misleading in describing what is done exactly once.

    It is said that it is time to revise this concept already, because There are some inconsistencies.

    It could be processed once. In reality, there is no possibility that you can guarantee exactly-once processing. It is an ever-present possibility.

    Dear author it is worth reminding how modern processors work. Each processor performs a large number of parallel stages during processing. Moreover, there are branches in which the processor starts to perform the wrong actions if the predictor of the transitions was wrong. In this case, the actions are rolled back. Thus, the handler can execute the same piece of code suddenly twice, even if no failures have occurred!

    The attentive reader will immediately exclaim: the exhaust is so important, and not how it is performed. Exactly! What matters is what happened as a result, not how it actually happened. If the result is as if it happened exactly once, then it means it happened exactly once. Do not find And everything else is a husk that is irrelevant. Systems are complex, and the resulting abstractions only create the illusion of execution in a certain way. It seems to us that the code is executed sequentially, instruction by instruction, that reading comes first, then writing, then a new instruction. But it is not, everything is much more complicated. And the essence of the right abstractions is to maintain the illusion of simple and understandable guarantees, without digging deep into each time you need to assign values ​​to a variable.

    And just the whole problem of this article lies in the fact that exactly-once is an abstraction that allows you to build applications without thinking about duplicates and lost values. That everything will be fine, even in the event of a fall. And there is no need to invent new terms for this.

    An example of the code in the article clearly demonstrates the lack of understanding of how to write handlers:

    Map (Event event) {
        Print "Event ID: " + event.getId()
        Return event

    The reader is invited to rewrite the code independently so as not to repeat the mistakes of the author of the article.

    So what does SPEs guarantee when they claim “exactly-once” processing semantics? If you are not logged on? When you claim, you can’t keep it up to date.

    The user does not need a guarantee of physical code execution. Knowing how the processor works, it is easy to conclude that this is impossible. The main thing - logical execution exactly once, as if there were no failures at all. Attracting the concepts of a commit into the data warehouse only aggravates the author’s lack of understanding of basic things, since there are implementations of similar semantics without the need for a commit.

    For more information, you can briefly review my article: Heterogeneous competitive data processing in real time strictly once .

    If you’re not in the same way, you can’t be

    The fact that there is a "durable backend state store" to the user is absolutely violet. What matters is the effect of the treatment, i.e. consistent state and output values ​​throughout the execution of stream data processing. It should be noted that for some tasks there is no need to have a durable backend state store, and guaranteeing exactly once would be nice.

    Here at Streamlio, we’ve decided that we’re the best way to handle these processing semantics.

    A typical example of stupid input of concepts: we will write some example and lengthy reasoning on a whole paragraph, and in the end we add that "this is how we define this concept." Accuracy and clarity of definitions causes a truly vivid emotional response.


    Failure to understand the essence of abstractions leads to a distortion of the original meaning of existing concepts and the subsequent invention of new terms from scratch.

    [1] Exactly once is NOT exactly the same .
    [2] Heterogeneous competitive data processing in real time strictly once .

    Also popular now: