Announcement of Moscow Spark # 4
Hello! New Year, new Spark, new Moscow Spark! We are starting the new season of our wonderful event on April 19 at the Attic of Rambler & Co. The framework does not stand still, and we, too, this time will introduce a new community site and try out a format with a star from abroad.
1. What's new in Spark 2.3? - Pavel Klemenkov, Chief Data Scientist @ Nvidia / Data Wizard @ BigDataTeam
In the report, I will consider three main, in my opinion, new Apache Spark features: continuous streaming, streaming ml and vectorized udf. For examples, consider the difference between continuous streaming and microbatch, how much faster it is and what limitations are associated with it. We will analyze the urgent problem of all machine learning specialists: how to get the model into prod and do it using the new, unified Streaming ML interface. And, in conclusion, we’ll look at how the developers overcame the final pain of PySpark performance with the help of the UDF vectorization.2. MOOC on Big Data: give everyone a cluster and check the solutions! - Oleg Ivchenko, Assistant @ MIPT / Data Wizard @ BigDataTeam, and Pavel Akhtyamov, Analyst Developer @ Vicman Development / Data Wizard @ BigDataTeam
Last year, our team (BigDataTeam) together with Yandex launched the Big Data for Data Engineers specialization. The uniqueness of this specialization is that student decisions are tested on a real cluster. The launch of such an infrastructure and its integration with Coursera turned out to be rather laborious and posed many interesting engineering problems. We will talk about them in the report. Namely:3. Apache Spark on Kubernetes the easy way - Dmitry Lakhvich [KrivdaTheTriewe], Senior Research Engineer @ Tookitaki / Data Engineer @ Maksimatelekom
1) how to build a Spark cluster with a Jupyter inside the Docker container
2) how to build your pipeline in Coursera task checks using the LTI interface
3) how to transfer the Jupyter laptop to the production cluster and check it on it
One of the innovations of Apache Spark 2.3 is experimental support for Kubernetes in the main branch. In this report, I will consider both the architecture of Kubernetes itself, its deployment, basic configuration in a minimal configuration, and the deployment of Apache Spark applications in Kubernetes. Some subtleties of the settings will be considered, as well as the question of why we need another scheduler (scheduler) and what benefits it brings.The event is free, and registration is required .
We have pizza and tea!
Beginning at 19.00
Location: Warsaw highway, 9, p. 1, entrance number 5. Attic Rambler & Co
Be sure to register and take your passport with you so that the security of the business center misses you!
Come, it will be interesting!