Why learn Spark?

    Why should developers learn Spark? How to master the technology at home? What is able and what is not able to Spark and what awaits it in the future? About this - in an interview with the trainer for Java and Big Data at EPAM Alexei Zinoviev .


    image- You're a Java and Big Data trainer - what does it mean? What do you do?

    - In EPAM, I prepare and conduct trainings at the request of teams for senior and leading engineers (or, as we say in the Aitish region, - Signor and Leads). Digging all the topics with the letter J at a deep level is not within the power of one person, so I specialize in the following: Java Concurrency, JVM Internals (those very guts), Spring Boot, Kafka, Spark, Lombok, the actor model - in general, which allows you to both increase the productivity of the developer himself and speed up the work of his application. Of course, if necessary, I can prepare training on Java EE or patterns, but such materials are already sufficient both in EPAM and outside it.

    - You have listed quite a few different topics.

    “Even in these topics, so many new questions and tasks constantly appear that almost every morning I have to say to myself:“ So, stop, that's what I don't do. ” So, using the cut-off method, it turns out to highlight a number of areas with which I work. One of them is Spark. The family of Spark frameworks is growing and expanding, so here you have to choose one thing to become a real expert. This year, I chose Structured Streaming to understand what is happening at the level of its sources and to quickly solve problems.

    - Why should a developer learn to work with Spark?

    - Three years ago, if you wanted to do Big Data, you should have been able to deploy nosebleeds, configure it, write bloody MapReduce jabs with pens, etc. Knowing Apache Spark is just as important now. Although now at an interview any Big Data engineer will still be driven by Hadoop. But, maybe, not so thoroughly, and they will not begin to demand experience of combat use.

    If under Hadoop integration bridges with other data formats, platforms, frameworks took a long and painful turn, then in the case of Spark we see a different situation. The community that develops it competes for the honor of connecting the next NoSQL database by writing a connector for it.

    This leads to the fact that many large companies pay attention to Spark and migrate to it: most of their wishes are implemented there. Previously, they copied Hadoop in general details, while they had their own peculiarities associated with supporting additional operations, some kind of internal optimizer, etc.

    - There is also a Spark Zoo. What can be done with it?

    - Firstly, the Spark Zoo helps you quickly build reports, extract facts, aggregates from a large number of both statically lying data and rapidly flowing into your Data Lake.

    Secondly, it solves the problem of integrating machine learning and distributed data, which are spread across a cluster and computed in parallel. This is done quite easily, and due to the R and Python connectors, Spark data scientists can use the capabilities of Spark, who are as far from the problems of building high-performance backends.

    Thirdly, he copes with the problem of integrating everything with everything. Everyone writes Spark connectors. Spark can be used as a quick filter to reduce input dimensionality. For example, to distill, filtering and aggregating a stream from Kafka, adding it to MySQL, why not?

    - And there are problems that Spark can not cope with?

    - Of course there is, because we are still not at the framework fair, so I will sell you the perfect hammer with which to paint the walls. If we talk about the same machine learning, work on building the ideal framework is still underway. Many copies of the final API design break, some of the algorithms are not parallelized (there are only articles and implementations on a single-threaded version).

    There is a definite problem in that Spark Core has already changed three turns of the API: RDD, DataFrame, DataSet. Many components are still built on RDD (I mean Streaming, most MLlib algorithms, processing large graphs).

    - What can you say about the new Spark frameworks?

    - All of them are not good enough to be used in production. The most ready now is Structured Streaming, which has emerged from an experimental underground. But it is not yet possible, for example, to join two streams. You have to do somersaults back and write on the DStreams / DataFrames mix. But there are almost no problems with the fact that developers break the API from version to version. Everything is quite calm here, and the code written in Spark a couple of years ago will work now with minor changes.

    - Where is Spark going? What tasks can he solve in the near future?

    - Spark moves to a total square-nested perception of reality a la DataFrame everywhere, for all components. This will allow you to painlessly remove RDD support in Spark 3.0 and completely focus on the engine to optimize the SparkAssembler into which your top-level set of operations on labels turns. Spark is pursuing a strong integration path with DeepLearning, particularly through the TensorFrames project.

    - What to expect in a year, for example?

    - I think in 2018 there will be more than now monitoring tools, deploy and other services that will be offered by “Spark-cluster in one click with full integration with everything in a row and a visual designer” for a reasonable price or even slightly free - only with payment of time of servers.

    - On Youtube there are many videos on how to put Spark in two clicks, but there is little material on what to do next. What do you recommend?

    - I can recommend several resources:


    - What level of developers should master Spark?

    - You can, of course, send someone to write code in Spark, and a person who has a couple of labs in Pascal or Python. He will be able to launch Hello World without any problems, but why should he?
    It seems to me that it will be useful for Spark to study developers who have already worked in the Bloody Enterprise, wrote backends, stored procedures. The same thing is about those who have solid experience in tuning DBMSs and query optimization, who have not forgotten Computer Science yet, who like to think about how to process data, lowering the constant in evaluating the complexity of an algorithm. If you have been team-leader for several years now, and "picking your source" is not yours, it is better to go past Spark.

    - Is it possible to master Spark at home?

    - You can start with a laptop with at least 8Gb RAM and a couple of cores. Just put IDEA Community Edition + Scala Plugin + Sbt (you can also Maven), drop a couple of dependencies and go. This will work even under Windows, but, of course, it’s better to roll it all up immediately under Ubuntu / CentOS.

    After that, you can deploy a small Spark cluster in the cloud for a project with data collection on the Web or to process any open dataset from github.com/caesar0301/awesome-public-datasets . Well, read my GitBook , of course.

    - What difficulties do you usually face when working with Spark?

    - What works on a small data set (testing methods and some JVM settings) often behaves differently on large heaps in production.

    Another difficulty for a Java developer is that you need to learn Scala. Most codebases and function signatures require reading Scala code with a dictionary. However, this is a pleasant difficulty.

    And last but not least, the complexity - even the Pet Project on the “small cluster” and “middle hand dataset” is very expensive. Amazon’s bills are increasing compared to web-based crafts for the development of the next Java framework.

    September 9 in St. Petersburg I will conduct training on Apache Spark for Java-developers. I will share my experience and tell you which Spark components should be used immediately, how to set up the environment, build your ETL process, how to work with the latest version of Spark and not only.

    Also popular now: