Apache Spark as the core of the project. Part 2. Streaming, and what we ran into

    Hello colleagues. Yes, three years have not passed since the first article , but the project abyss has released only now. I want to share with you my thoughts and concerns regarding Spark streaming in conjunction with Kafka. Perhaps among you there are people with successful experience, so I will be glad to talk in the comments.

    So, in our project there is a need to make decisions in real time. We successfully use Spark for batch processing of data, and therefore we decided to use it for realtime. This gives us a single technological platform and a single code base.

    Workflow looks like this: All events are queued (Apache Kafka), and then they are read and processed by consumers based on Spark streaming. Consumers must solve two problems:

    • Data routing (redirect data streams to various storages)
    • Real-time decision making.

    The data that comes to Kafka should ultimately end up in HDFS as raw log files converted to parquet, and in HBase as user profile attributes. At one time, for a similar routing, we quite successfully used Apache Flume, but this time decided to entrust this business to Spark streaming. Spark out of the box can work with both HDFS and HBase, in addition, developers guarantee “exactly once” semantics. Now, let's look at a little more with the semantics of data delivery (Message Delivery Semantics).
    There are three types of them:

    • At most once - A message may be lost, but never delivered more than once.
    • At least once - A message can never be lost, but can be delivered more than once.
    • Exactly once - This is what people want. A message can be delivered only once, and cannot be lost.

    And here lies the biggest misunderstanding. When Spark developers talk about exactly once semantics, they mean only Spark. That is, if the data fell into the sparking process, then they will be delivered once to all user functions involved in processing, including those located on other hosts.

    But as you know, a complete workflow does not consist of just a spark. Three parties are involved in our process, and semantics should be considered for the whole connective.

    As a result, we have two problems: data delivery from Kafka to Spark, and data delivery from Spark to storage (HDFS, HBase).

    From Kafka to Spark


    Theoretically * the problem of data delivery from Kafka to Spark is solved, and in two ways.

    Method One, Old (Receiver-based Approach)


    Spark implements a driver that uses the Kafka consumer API for tracking offsets. These same classic classics offset are stored in Zookeeper. And everything would be fine, but there is a non-zero probability of message delivery more than once, at times of failure, and this is At least once.

    Method Two, New (Direct Approach (No Receivers))


    The developers have implemented a new spark driver, which itself is engaged in tracking offset. He stores information about the read data in HDFS, in the so-called checkpoints. This approach guarantees exactly once semantics, and that is what we use.

    Problem # number of times


    Spark sometimes spoils checkpoints, so much so that it can not then work with them, and goes into a state of severe drug intoxication. He stops reading the data, but at the same time continues to hang in his memory and tell everyone that everything is fine with him. What is the cause of this problem is not completely clear. Accordingly, we kill the process, delete checkpoints, start and read everything all over again, or from the end. And this, too, is not exactly once)) For historical reasons, we are using version 1.6.0 on Cloudera. Maybe it’s worth updating, and everything will pass.

    Problem # Number Two


    Kafka - rarely but accurately hits. There are situations that a broker falls. It is simply impossible to understand why the fall occurred, due to completely uninformative logs. The fall of a broker is not scary, Kafka is designed for that. But if you blink and do not restart the broker in time, then the entire cluster is inoperative. This of course does not happen in one hour, but nonetheless.

    From Spark to External Storage


    Things are not so good here. The developer must take care of the guarantees of data delivery from the spark to external storage, which brings a strong overhead to the development and architecture. If at this level you need exactly once semantics, then you will not have to bother with it. By the way, we have not solved the problem in this part of the system, and are satisfied with At most once semantics.

    Summary:


    According to my feelings, you can use Spark streaming, but only if your data does not have special financial value and you are not afraid to lose it. The ideal case is when the data is guaranteed to enter the storage using some other, more reliable subsystem, and Spark streaming is used as a crude tool that helps to generate some kind of rough conclusions or inaccurate statistics, but in real time, with subsequent refinement in batch processing mode.

    Also popular now: