Apache Spark: what's under the hood?


    Recently, the Apache Spark project has attracted great attention, a large number of small practical articles have been written about it, it has become part of Hadoop 2.0. Plus, he quickly overgrown with additional frameworks, such as Spark Streaming, SparkML, Spark SQL, GraphX, and besides these “official” frameworks, a lot of projects appeared - various connectors, algorithms, libraries, and so on. It’s enough to quickly and confidently figure out this zoo with the lack of serious documentation, especially considering the fact that Spark contains all sorts of basic pieces of other Berkeley projects (for example, BlinkDB) - this is not an easy task. Therefore, I decided to write this article in order to make life easier for busy people.

    A little background:

    Spark is a UC Berkeley lab project that began around 2009. The founders of Spark are well-known scientists from the field of databases, and according to their philosophy, Spark is in some way an answer to MapReduce. Spark is now under the Apache “roof”, but the ideologists and core developers are the same people.

    Spoiler: Spark in 2 Words

    Spark can be described in one sentence like this - this is the inside of the engine of a massive parallel DBMS. That is, Spark does not promote its storage, but lives above others (HDFS - the distributed file system Hadoop File System, HBase, JDBC, Cassandra, ...). The truth is that the IndexedRDD project is worth mentioning right away - the key / value storage for Spark, which will probably be integrated into the project soon. Also, Spark does not care about transactions, but otherwise it is the MPP DBMS engine.

    RDD - the core concept of Spark

    The key to understanding Spark is RDD: Resilient Distributed Dataset. In essence, this is a reliable distributed table (in fact, RDD contains an arbitrary collection, but it is most convenient to work with tuples, as in a relational table). RDD can be completely virtual and just know how it originated, so that, for example, in the event of a node failure, recover. And it can be materialized - distributed, in memory or on disk (or in memory with extrusion to disk). Also, internally, RDD is partitioned - this is the minimum amount of RDD that will be processed by each work node.

    Everything interesting that happens in Spark happens through operations on RDD. That is, usually an application for Spark looks like this - we create an RDD (for example, we get data from HDFS), we mess it up (map, reduce, join, groupBy, aggregate, reduce, ...), we do something with the result - for example, we throw it back into HDFS.

    Well, already on the basis of this understanding, Spark should be considered as a parallel environment for complex analytical bunch of tasks, where there is a master who coordinates the task, and a bunch of work nodes that participate in the execution.

    Let's look at such a simple application in detail (we will write it on Scala - this is an occasion to learn this fashionable language):

    Spark application example (not all inclusive, e.g. include)

    We will separately analyze what happens at each step.

    def main(args: Array[String]){
      // Инициализация, не особо интересно
     val conf = new SparkConf().setAppName(appName).setMaster(master) 
     val sc = new SparkContext(conf)
     // Прочитаем данные из HDFS, сразу получив RDD
     val myRDD  = sc.textFile("hdfs://mydata.txt")
     // Из текстового файла мы получаем строки. Не слишком интересные данные.
     // Мы из этих строк сделаем кортежи, где первый элемент (сделаем его потом 
     // ключем) - первое "слово" строки 
     val afterSplitRDD = myRDD.map( x => ( x.split(" ")( 0 ), x ) )
     // Сделаем группировку по ключу: ключ - первый элемент кортежа
     val groupByRDD = afterSplitRDD.groupByKey( x=>x._1 )
     // Посчитаем кол-во элементов в каждой группе
     val resultRDD = groupByRDD.map( x => ( x._1, x._2.length ))
     // Теперь можно записать результат обратно на HDFS

    What is going on there?

    Now let's go over this program and see what happens.

    Well, firstly, the program runs on the master of the cluster, and before any parallel processing of the data goes, there is an opportunity to do something quietly in one thread. Further - as it is already probably noticeable - each operation on the RDD creates a different RDD (except saveAsTextFile). At the same time, RDDs are all created lazily, only when we ask to either write to a file, or for example upload them to the master, execution begins. That is, the execution occurs as in the query plan, by the conveyor, where the conveyor element is a partition.

    What happens to the very first RDD we made from an HDFS file? Spark is well integrated with Hadoop, therefore, on each working node its subset of data will be uploaded, and it will be downloaded according to partitions (which coincide with the blocks in the case of HDFS). That is, all the nodes downloaded the first block, and the execution went further according to plan.

    After reading from the disk, we have a map - it runs trivially on each working node.

    Next comes groupBy. This is no longer a simple pipeline operation, but a real distributed grouping. For the good, it’s better to avoid this operator, because while it is not implemented very cleverly, it poorly tracks data locality and will be comparable in performance to distributed sorting. Well, this is information for consideration.

    Let's think about the state of affairs at the time of execution of groupBy. All RDDs were previously pipelined, that is, they did not save anything anywhere. In the event of a failure, they would again pull the missing data from the HDFS and pass it through the pipeline. But groupBy breaks the pipeline and as a result we get a cached RDD. In case of loss, now we will be forced to redo all RDD to groupBy completely.

    To avoid a situation where due to failures in a complex application for Spark it is necessary to recalculate the entire pipeline, Spark allows the user to control caching with the persist statement. It can cache in memory (in this case, recounting occurs when data is lost in memory - it can happen when the cache overflows), to disk (not always fast enough), or to memory with ejection to disk in case of cache overflow.

    After, we again have a map and an entry in HDFS.

    Well, now it’s more or less clear what is happening inside Spark at a simple level.

    But what about the details?

    For example, I want to know how the groupBy operation works. Or the reduceByKey operation, and why it is much more efficient than groupBy. Or how join and leftOuterJoin work. Unfortunately, most of the details so far are easiest to learn only from the Spark source or by asking a question on their mailing list (by the way, I recommend subscribing to it if you will do something serious or non-standard on Spark).

    Even worse, we understand what is happening in the various Spark connectors. And how much you can use them at all. For example, we temporarily had to abandon the idea of ​​integrating with Cassandra because of their incomprehensible support for the Spark connector. But there is hope that quality documentation will appear in the near future.

    What interesting things do we have on top of Spark?

    • SparkSQL: SQL engine on top of Spark. As we have already seen, Sparke already has almost everything for this, except for storage, indexes and its own statistics. This seriously complicates the optimization, but the SparkSQL team claims that they are sawing their new optimization framework, and the AMP LAB (laboratory where Spark grew up from) is not going to refuse the Shark project - a complete replacement for Apache HIVE
    • Spark MLib: This is essentially a replacement for Apache Mahaout, only much more serious. In addition to efficient parallel machine learning (not only with RDD, but also with additional primitives), SparkML works much better with local data using the Breeze native linear algebra package, which will attract Fortran code to your cluster. Well, a very well-designed API. A simple example: we are simultaneously training on a cross-validation cluster.
    • BlinkDB: A very interesting project - inaccurate SQL queries on top of large amounts of data. We want to calculate average for some field, but we want to do it in no more than 5 seconds (having lost accuracy) - please. We want a result with an error of no more than a given one - it is also suitable. By the way, pieces of this BlinkDB can be found inside Spark (this can be considered as a separate quest).
    • Well, many, many things are now written on top of Spark, I have listed only the most interesting projects from my point of view

    Also popular now: