Apache Spark as the core of the project. Part 1
Hello colleagues.
Recently, Spark appeared on our project. In the development process, we face many difficulties, and we learn a lot. I want to systematize this knowledge for myself, and for one share it with others. So I decided to write a series of articles about using Apache Spark. This article is the first and it will be introductory.
So, quite a lot has been written about Spark itself, including once and twice on the Habré itself . Therefore, you have to repeat a little.
Apache Spark is a framework with which you can create applications for distributed data processing. For its part, Spark provides a software API for working with data, which includes: loading, saving, transforming and aggregating, plus a lot of small things, for example, the ability to run locally for the purpose of developing and debugging code.
In addition, Spark is responsible for the distributed execution of your application. It scatters your code on all nodes of the cluster, breaks it into subtasks, creates an execution plan and monitors success. If a failure occurs on any node and some subtask fails, it will be restarted.
The flexibility of Spark lies in the fact that your applications can be run under the control of different distributed systems:
All these systems have their own advantages, which are relevant for various tasks and requirements.
Let's see why in recent times the popularity of Spark has been growing, and why it began to supplant the good old Hadoop MapReduce (hereinafter simply MR).
The thing is the new architectural approach, which significantly outperforms classic MR applications.
Here's the thing: MR began to be developed in the 2000s, code RAM was expensive, and 64-bit systems have not yet captured the world. Therefore, the developers then went along the only right path - they implemented the exchange of intermediate data through the hard drive (or, to be precise, through the distributed HDFS file system). That is, all the intermediate data between the Map and Reduce phases were reset in HDFS. As a result, a lot of time was spent on disk I / O and data replication between nodes of the Hadoop cluster.
Spark appeared later, and in completely different conditions. Now the intermediate data is serialized and stored in RAM, and data exchange between nodes occurs directly, through the network, without unnecessary abstractions. It is worth saying that disk I / O is still used (at the shuffle stage). But its intensity is much less.
In addition, the initialization and launch of Spark tasks is now much faster due to JVM optimizations. MapReduce launches a new JVM for each task, with all the ensuing consequences (downloading all JAR files, JIT compilation, etc.), while Spark keeps a running JVM on each node, controlling the task launch through RPC calls.
Finally, Spark uses RDD abstractions (Resilient Distributed Dataset), which are more universal than MapReduce. Although for fairness it must be said that there is Cascading. This is a wrapper over MR, designed to add flexibility.
In addition, there is another, very important fact - Spark allows you to develop applications not only for batch processing tasks, but also for working with data streams (stream processing). While providing a single approach, and a single API (though with slight differences).
Spark API is well documented at the office. site , but for the integrity of the story, let's briefly get to know it on an example that you can run locally:
We calculate for each user the total number of sites visited by him. And we will return Top 10, sorted in descending order.
Yes, it is worth saying that the Spark API is available for Scala, Java and Python. But still, it was originally designed specifically for Scala. Be that as it may, we are using Java 8 in the project and in general we are quite happy. Go to the rock until we see no reason.
In the next article, we will consider stream processing in detail, why it is needed, how it is used in our project, and what SparkSQL is.
Recently, Spark appeared on our project. In the development process, we face many difficulties, and we learn a lot. I want to systematize this knowledge for myself, and for one share it with others. So I decided to write a series of articles about using Apache Spark. This article is the first and it will be introductory.
So, quite a lot has been written about Spark itself, including once and twice on the Habré itself . Therefore, you have to repeat a little.
Apache Spark is a framework with which you can create applications for distributed data processing. For its part, Spark provides a software API for working with data, which includes: loading, saving, transforming and aggregating, plus a lot of small things, for example, the ability to run locally for the purpose of developing and debugging code.
In addition, Spark is responsible for the distributed execution of your application. It scatters your code on all nodes of the cluster, breaks it into subtasks, creates an execution plan and monitors success. If a failure occurs on any node and some subtask fails, it will be restarted.
The flexibility of Spark lies in the fact that your applications can be run under the control of different distributed systems:
- Stand-alone mode . In this mode, you can independently deploy the Spark infrastructure, it will manage all the resources of the cluster and run your applications.
- Yarn . This computing platform is part of the Hadoop ecosystem. Your spark application can be run on a Hadoop cluster running this platform.
- Mesos . Another alternative cluster resource management system.
- Local mode . Local mode, created for development and debugging, to make our life easier for you.
All these systems have their own advantages, which are relevant for various tasks and requirements.
Why is Spark becoming No. 1?
Let's see why in recent times the popularity of Spark has been growing, and why it began to supplant the good old Hadoop MapReduce (hereinafter simply MR).
The thing is the new architectural approach, which significantly outperforms classic MR applications.
Here's the thing: MR began to be developed in the 2000s, code RAM was expensive, and 64-bit systems have not yet captured the world. Therefore, the developers then went along the only right path - they implemented the exchange of intermediate data through the hard drive (or, to be precise, through the distributed HDFS file system). That is, all the intermediate data between the Map and Reduce phases were reset in HDFS. As a result, a lot of time was spent on disk I / O and data replication between nodes of the Hadoop cluster.
Spark appeared later, and in completely different conditions. Now the intermediate data is serialized and stored in RAM, and data exchange between nodes occurs directly, through the network, without unnecessary abstractions. It is worth saying that disk I / O is still used (at the shuffle stage). But its intensity is much less.
In addition, the initialization and launch of Spark tasks is now much faster due to JVM optimizations. MapReduce launches a new JVM for each task, with all the ensuing consequences (downloading all JAR files, JIT compilation, etc.), while Spark keeps a running JVM on each node, controlling the task launch through RPC calls.
Finally, Spark uses RDD abstractions (Resilient Distributed Dataset), which are more universal than MapReduce. Although for fairness it must be said that there is Cascading. This is a wrapper over MR, designed to add flexibility.
In addition, there is another, very important fact - Spark allows you to develop applications not only for batch processing tasks, but also for working with data streams (stream processing). While providing a single approach, and a single API (though with slight differences).
And what does it look like in code?
Spark API is well documented at the office. site , but for the integrity of the story, let's briefly get to know it on an example that you can run locally:
We calculate for each user the total number of sites visited by him. And we will return Top 10, sorted in descending order.
public class UsersActivities {
public static void main( String[] args ) {
final JavaSparkContext sc = new JavaSparkContext(
new SparkConf()
.setAppName("Spark user-activity")
.setMaster("local[2]") //local - означает запуск в локальном режиме.
.set("spark.driver.host", "localhost") //это тоже необходимо для локального режима
);
//Здесь могла быть загрузка из файла sc.textFile("users-visits.log");
//Но я решил применить к входным данным метод parallelize(); Для наглядности
List visitsLog = Arrays.asList(
"user_id:0000, habrahabr.ru",
"user_id:0001, habrahabr.ru",
"user_id:0002, habrahabr.ru",
"user_id:0000, abc.ru",
"user_id:0000, yxz.ru",
"user_id:0002, qwe.ru",
"user_id:0002, zxc.ru",
"user_id:0001, qwe.ru"
//итд, дофантазируйте дальше сами :)
);
JavaRDD visits = sc.parallelize(visitsLog);
//из каждой записи делаем пары: ключ (user_id), значение (1 - как факт посещения)
// (user_id:0000 : 1)
JavaPairRDD pairs = visits.mapToPair(
(String s) -> {
String[] kv = s.split(",");
return new Tuple2<>(kv[0], 1);
}
);
//суммируем факты посещений для каждого user_id
JavaPairRDD counts = pairs.reduceByKey(
(Integer a, Integer b) -> a + b
);
//сиртируем по Value и возвращаем первые 10 запсисей
List> top10 = counts.takeOrdered(
10,
new CountComparator()
);
System.out.println(top10);
}
//Такие дела, компаратор должен быть Serializable. Иначе (в случае анонимного класса), получим исключение
//SparkException: Task not serializable
//http://stackoverflow.com/questions/29301704/apache-spark-simple-word-count-gets-sparkexception-task-not-serializable
public static class CountComparator implements Comparator>, Serializable {
@Override
public int compare(Tuple2 o1, Tuple2 o2) {
return o2._2()-o1._2();
}
}
}
Yes, it is worth saying that the Spark API is available for Scala, Java and Python. But still, it was originally designed specifically for Scala. Be that as it may, we are using Java 8 in the project and in general we are quite happy. Go to the rock until we see no reason.
In the next article, we will consider stream processing in detail, why it is needed, how it is used in our project, and what SparkSQL is.