Introduction to Apache Spark

    Hi, Habr!


    Last time we looked at the wonderful Vowpal Wabbit tool , which is useful in cases when you have to learn from samples that do not fit into RAM. Recall that the feature of this tool is that it allows you to build primarily linear models (which, by the way, have good generalizing ability), and high quality algorithms are achieved through the selection and generation of features, regularization, and other additional techniques. Today we’ll look at a tool that is more popular and designed to process large amounts of data - Apache Spark .

    We will not go into details about the history of this tool, as well as its internal structure. Focus on practical things. In this article, we will look at the basic operations and basic things that can be done in Spark, and next time we will take a closer look at the machine learning library MlLib , as well as GraphXfor graph processing (the author of this post mainly uses this tool for this - this is just the case when the graph often needs to be kept in RAM on the cluster, while Vowpal Wabbit is often enough for machine learning). There will not be much code in this manual, because The basic concepts and philosophy of Spark are considered. In the following articles (about MlLib and GraphX) we will take some dataset and take a closer look at Spark in practice.

    Immediately, Spark natively supports Scala , Python, and Java . We will consider examples in Python, because it’s very convenient to work directly in the IPython Notebook , unloading a small part of the data from the cluster and processing, for example, with a packagePandas - it turns out to be a pretty convenient bunch.

    So, let's start with the fact that the main concept in Spark is RDD (Resilient Distributed Dataset) , which is a Dataset on which you can make two types of transformations (and, accordingly, all work with these structures is in the sequence of these two actions).


    The result of applying this operation to an RDD is a new RDD. As a rule, these are operations that transform elements of a given dataset in any way. Here is an incomplete of the most common transformations, each of which returns a new dataset (RDD):

    .map (function) - applies the function function to each element of the

    dataset .filter (function) - returns all elements of the dataset on which the function function returned the true value

    .distinct ( [numTasks]) - returns a dataset that contains unique elements of the original dataset.

    It is also worth noting about operations on sets, the meaning of which is clear from the names:

    .union (otherDataset)

    .intersection (otherDataset)

    .cartesian (otherDataset) - the new dataset contains all kinds of pairs (A, B), where the first element belongs to the original dataset, and the second to the argument dataset


    Actions are applied when it is necessary to materialize the result - as a rule, save data to disk, or display some of the data in the console. Here is a list of the most common actions that can be applied to RDD:

    .saveAsTextFile (path) - saves data to a text file (in hdfs, to the local machine or to any other supported file system - the full list can be found in the documentation)

    .collect () - returns dataset elements as an array. As a rule, this is applied in cases when there is already little data in the dataset (various filters and transformations have been applied) —and visualization or additional data analysis is necessary, for example, using the Pandas

    .take (n) package — returns the first n elements of the dataset as an array

    .count () - returns the number of elements in the dataset

    .reduce (function) - a familiar operation for those familiar with MapReduce . It follows from the mechanism of this operation that the function function (which takes 2 arguments as input returns one value) must be commutative and associative.

    These are the basics that you need to know when working with the tool. Now let's take a little practice and show how to load data into Spark and do simple calculations with them.

    When starting Spark, the first thing to do is create a SparkContext (in simple words, this is the object that is responsible for implementing lower-level operations with the cluster - more details - see the documentation), which at startupSpark-Shell is created automatically and is immediately available ( sc object )

    Data loading

    There are two ways to upload data to Spark:

    a). Directly from the local program using the .parallelize (data) function

    localData = [5,7,1,12,10,25]
    ourFirstRDD = sc.parallelize(localData)

    b) From supported repositories (e.g. hdfs) using the .textFile (path) function

    ourSecondRDD = sc.textFile("path to some data on the cluster")

    At this point, it is important to note one feature of data storage in Spark and at the same time the most useful function .cache () (partly due to which Spark has become so popular), which allows you to cache data in RAM (taking into account the availability of the latter). This allows you to perform iterative calculations in RAM, thereby getting rid of IO-overhead'a. This is especially important in the context of machine learning and graph computing, as most algorithms are iterative - from gradient methods to algorithms like PageRank

    Work with data

    After loading the data into the RDD, we can do various transformations and actions on it, as mentioned above. For example:

    Let's see the first few elements:

    for item in 
         print item

    Or immediately load these elements into Pandas and work with the DataFrame:

    import pandas as pd
    pd.DataFrame( x: x.split(";")[:]).top(10))

    In general, as you can see, Spark is so convenient that there’s probably no point in writing various examples, or you can just leave this exercise to the reader - many calculations are written in just a few lines.

    Finally, we show only an example of transformation, namely, we calculate the maximum and minimum elements of our dataset. As you can easily guess, this can be done, for example, using the .reduce () function :

    localData = [5,7,1,12,10,25]
    ourRDD = sc.parallelize(localData)
    print ourRDD.reduce(max)
    print ourRDD.reduce(min)

    So, we examined the basic concepts necessary for working with the tool. We did not consider working with SQL, working with <key, value> pairs (which is easy - just apply a filter to RDD, for example, to select a key, and then it's easy to use built-in functions like sortByKey , countByKey , join , etc.) - the reader is invited to familiarize themselves with this independently, and if you have questions, write in the comments. As already noted, next time we will examine in detail the MlLib library and, separately, GraphX

    Also popular now: