ARG89 September 27, 2016 at 16:05

Welcome to Spark ... on Java: Interview with Evgeny Borisov

Big Data is a problem. The amount of information is growing every day, and it accumulates like a snowball. The great thing is that this problem has solutions, only in the JVM world tens of thousands of projects will process more data.

In 2012, the Apache Spark framework was developed, developed on Scala and designed to improve the performance of certain classes of tasks in working with Big Data. For 4 years, the project has matured and grown to version 2.0, which (in fact, already starting from version 1.3-1.5) has a powerful and convenient API for working with Java. To understand for whom this is all necessary, which tasks should be solved with the help of Spark and which are not worth it, we talked with Evgeny EvgenyBorisov Borisov, the author of the training “Welcome to Spark”, which will be held October 12-13 in St. Petersburg.

JUG.RU: Eugene, welcome! Come on from the beginning. Tell me briefly what Spark is and what it is eaten with.

Eugene: First of all, Apache Spark is a framework: a certain API that allows Big Data to be processed with an unlimited number of resources, machines, and it can be independently scaled. To make Java developers completely understandable, imagine an old ~~spiteful~~ JDBC, with which you can communicate with a database: read something from it, write something into it, and Spark allows you to write something to the database and read but your code will scale unlimitedly.

And here a reasonable question arises: where can I write and where can I write? Can write in Apache HadoopYou can work with HDFS, you can with Amazon S3. In general, there are a lot of distributed storages, and for many the API for Spark has already been written, for others it is written. For example, Apache Cassandra has its own connector for this (in DataStax Enterprise), which makes it possible to write Spark with Cassandra. In the end, it is possible with the local file system: although this does not need to be done here (there is nowhere to scale), this feature is usually used for testing.

Information is accumulating more and more every year, respectively, accordingly, there is a desire to process it with an unlimited amount of resources

JUG.RU: So, Spark is spinning on a distributed infrastructure. Does this mean that the project is purely “enterprise”, or in principle can it be used in some personal projects?

Eugene: Today it is more likely an enterprise framework, but Big Data is spreading now at such a rate that soon there will be nowhere without it: more and more information is accumulating every year, respectively, there is a desire to process it with an unlimited amount of resources. Today you have little information, and there is a code that only processes it, and when more is accumulated, you will have to rewrite everything. So? And if everything was initially processed on Spark, then you can simply increase the cluster by several machines, and the code does not need to be changed at all.

JUG.RU: You say that Spark is still an enterprise class. This is an important question, but what about stability and backward compatibility?

Eugene: Since Spark is written in Scala, while Scala with backward compatibility is bad enough, Spark also suffers a little from this. It happens that you updated, and some kind of functionality suddenly fell off, but this happens to a much lesser extent than in Scala. Still, the API here is much more stable and everything is not so critical: "breakdowns" are usually resolved locally, pointwise.

The second version of Spark is now released, it looks very cool, but so far I can’t say how much everything has collapsed there, so far no one has switched to it. I’ll have time to prepare some kind of a review topic for the training and show what has changed and been updated.

It is worth adding here that despite the fact that Spark itself is written in Scala, and I am a small advocate for those who do not know it to write on Scala . I am often reproached "here, they say, you run into Scala." I do not run into Scala! I just think that if someone does not know Scala, but wants to write on Spark, then this is not a reason to learn Scala.

JUG.RU: Okay, we decided that in principle it is better to write in Scala under Scala, but if you can’t in Scala, then you can write in Ja ...

Eugene: No, no, no! I did not say that it is better to write in Scala. I said that people who know Skala well and write under Spark on it is completely normal.

But if a person says: “Well, I don’t know Scala, but I need to write on Spark because I realized how cool this is. Just what do I need to teach Rock now? ”I say that this makes absolutely no sense! And it was fun for me to read the comments of the people who wrote “here we tried to write in Java, since we didn’t know Scala, but we knew Java, but then everything got so bad that in the end I had to switch to Scala, and now- then we are happy. ”

Today, this is not at all like this: it would be true if we were talking about Java 7, and Spark was old, still with RDD - yes, then there really came out such three-story structures that were completely impossible to understand.

Today, Java is the eighth, and data frames have appeared in Spark (from version 1.3 yet) that provide an API that no matter what it runs on: in Scala, in Java, you can live perfectly without Scala.

JUG.RU: If I can write in Java 8, what is the entry threshold for Spark? Will I have to learn a lot, read smart books?

Eugene : Very low, especially if you already have hands-on experience with Java 8. Take the streams of the eighth Java: the ideology is very similar there, even most methods are called the same. You can start and figure it out yourself.

Trainingneeded rather in order to deal with all sorts of subtleties, tricks and nuances. In addition, since the training will be for Java developers, I will show how you can customize everything with Spring, how you can write the infrastructure with it so that Spark’s performance tricks can be done using annotations so that everything works "from the box".

JUG.RU: Everywhere they write and say that Spark is great for working with Big Data, and this is understandable. But the question that doesn’t sound so often is: why is Spark not suitable? What are the limitations and for what tasks is it definitely not worth taking it?

Eugene: There is definitely no point in taking Spark for tasks that do not scale, in the sense that if the task does not scale according to its ideology, then Spark will not help.

As an example: window functions, despite the fact that they are in Sparke (in general, you can do everything on Spark), but they work really slowly. Over time, and here it should get better, they are moving in this direction.

JUG.RU: This, by the way, is a good story, where is Spark going? It is clear that now it is already possible to process some data quickly and well.

Eugene: In the first Spark was RDD (Resilient Distributed Dataset), which makes it possible to process data using code. Since the data usually has a column-based structure, but it turns out that you cannot refer to the column names, this is simply not the case in the RDD API. And if you have a file with a huge number of columns and a huge number of lines, and you write logic that processes all this, then you get very unreadable code.

Data frames allowed the data to be processed while maintaining the structure using column names: the code became much more readable, people familiar with SQL became very good in this world. On the other hand, the ability to fine-tune the logic began to be missed.

As a result, it turned out that it became convenient to do certain things with the help of RDD, and which with the help of data frames. The second Spark united this whole thing in a structure called a DataSet , and you can work in it anyway , without leaving one API. Plus, everything began to work much faster. Accordingly, if we say where Spark is heading, we can say that they make many different optimizations, and the framework is working faster and faster.

The framework is faster and faster.

JUG.RU: Clearly, the movement is towards speed and flexibility. Here is the time to ask about the infrastructure: what tools work with Spark? In the report on JPoint you tell in detail that you can work with Hadoop, you can without Hadoop, and so on.

But in the comments to the last article, the opinion was voiced that Spark without Yarn is a bad man, and you won’t get any normal resource management there.

Evgeny:I do not agree with this opinion. Let’s first look at how it starts: there is something coordinating the work of the workers to which our code is sent, which runs in parallel. So Yarn, of course, coordinates all this much faster, and he also knows how to monitor the status of workers and, if necessary, restart them. But you can work without Yarn, if necessary. There is a Spark Standalone, which, of course, is slower and not so powerful, but, besides this, there are also alternatives: Apache Mesos, for example, others are still sawing now. I’m sure that in five years they will be full, far from everything is tied to YARN. Moreover, for distributed storage there is also a bunch of tools, as I said at the beginning.

JUG.RU: They sorted out the theory more or less. Why Spark is not needed is also understandable. And can you give examples of the use of Spark from your own experience? Surely, something interesting was in the Big Data area.

Eugene: I don’t know about the “interesting”, after all, I mainly implemented enterprise projects, there’s not much fun there. Another thing is that writing was interesting, fast and convenient.

Of the interesting cases, I can recall the service for telephone companies: imagine, you flew to another country, did not change your SIM card, and you need to select roaming. How is he chosen, on the basis of what? Cheap, expensive, profitable, unprofitable - in order to make such decisions, telephone companies must analyze all their data: every call, who, whom, where, from where you called, whether there was good communication - everything is fixed. Specifically, this project calculated such data for all calls around the world: analyzed it all and displayed different statistics.

The second example is Slice. There are people who open access to their mailboxes in order to analyze some purchases, orders, tickets, etc. from the mail, in order to better target ads and receive some offers. Here, again, there is a processing of a wild number of letters, they all store it in Redshift'e on Amazon'e: everything needs to be structured, calculated, somehow still quickly process, in order to also issue some statistics, based on which customers already give targeted advertising or recommendations. Here we screwed Spark to improve the performance, without it everything worked very slowly.

JUG.RU: Does Spark process data in real time?

Eugene: And in real time you can, and batch'ami.

JUG.RU: I see, what about data validation? Are there any tools that simplify data verification for integrity or correctness?

Eugene: Well, this is all solved at the code level: you take a million lines of code and the first thing you do is throw out the invalid. By the way, statistics are usually collected on them: how much such data is, why it is incorrect, this is also done by Spark.

JUG.RU: Lastly, I cannot but touch upon the question “Java vs Scala in Spark” once again. Even more out of curiosity. Which side are you on?

Evgeny:I'm more likely to take the side of Java, although many people do not like me for this. I can be understood - I wrote Java for fifteen years, wrote Groovy for several years, which, of course, is the next step compared to Java, but with the release of the eighth version of Java, everything was not so clear. Now, starting a new project, I think every time I start it in Java 8 or Groovy.

But Scala is a completely different world! And it’s harder to understand, like in an instrument. Some macropatterns are not there. There was a period when I had to write in Scala - I suffered terribly. Naturally, when you suffer, you go to other people to consult. You consult with one person how to build architecture, he says one thing. If you ask another, you get a completely different answer! In the Java world, everything is much more even, much more experience, more people, community: I have services, I have Tao, I have dependency injection, I have Spring, for this there is Spring Data or, there, Spring MVC, throw it away all this ton of information that people have collected, and go to Scala, learn everything from scratch? What for? If in Java everything works no worse. I understand that if Scala worked two to three times faster, or the API would be 10 times more convenient,

I recalled a funny incident. After the report, a man approached me in Lviv and said:
“Listen, you don’t like Scala, huh?”
“I did not say that,” I answer.
- Well, it feels the same.
“Well, I just think that for a project that already has Java programmers, there’s no point in wasting time translating them to Scala.”
- And you yourself wrote on the Rock?
- Well, I wrote a little.
- How many?
- Six months somewhere.
- Ha, half a year ... For half a year you could not understand Scala! It should be at least two or three years.

So at that moment I realized that I was absolutely right. That's where the entry threshold is high, at the level of 2-3 years. After all, the question is not whether the language is good or bad. The truth is that if you can write in Java - write in Java, with Spark, in any case, it is easy and simple. Less and less often the developers on my projects have situations when they look for some solutions on Google and find on Scala, but not on Java. Previously, this was a lot, but now there is practically no.

Actually, I’ll say again that the Scala mission has changed not so long ago: instead of transferring developers to Scala (which has not happened in 12 years), their goal now is to use all Scala products in Java, - this is already felt very strongly: now with each new version of the same Spark it is becoming more and more sharpened, including under Java, with each version the difference is less and less.

If a year ago on GitHub I compared the number of projects for Spark in Java and Scala, and the distribution was around 3000 and 7000, now these numbers are closer

If a year ago on GitHub I compared the number of projects for Spark in Java and Scala, and the distribution was around 3000 and 7000, now these numbers are closer. And even then the gap was not so wide, this is a huge number of programmers who write Java under Spark, and everything works fine for them.

JUG.RU: Eugene, thank you, and see you at the training!

In general, if you find the Java theme on Spark interesting, then we will be glad to see you at our training. There will be many tasks, live coding, and in the end you will leave this training with sufficient knowledge to start working independently on Spark-e in the familiar Java world. Details about which you can read on the corresponding page .

And if the training is not interesting to you, then you can meet with Eugene at the Joker conference, where he will give two reports:
- Myths about Spark, or Can Spark be a regular Java developer (a very short version of the training) ;
- Maven vs Gradle: At the dawn of automation (with Baruch Sadogursky) ;

Tags:

Welcome to Spark ... on Java: Interview with Evgeny Borisov

Information is accumulating more and more every year, respectively, accordingly, there is a desire to process it with an unlimited amount of resources

The framework is faster and faster.

If a year ago on GitHub I compared the number of projects for Spark in Java and Scala, and the distribution was around 3000 and 7000, now these numbers are closer

Also popular now: