To the magistracy without exams: a new direction "Big data" at the Olympiad "I am a professional"

    We continue the story about the competition for bachelors, masters and specialists " I am a professional ." It is conducted with the support of the strongest universities. Today we will tell about the new competitive direction, which is supervised by ITMO University, “Big data”.

    The general partner of the “I am a Professional” Olympiad in the areas of ITMO University is Computer Science, Information and Cyber ​​Security, Big Data - Sberbank.

    Christoph Scholz / Flickr / CC BY-SA

    A couple of words about the competition "I am a professional"

    The Olympiad is held for students of various specialties.

    This year 54 directions are registered : there is mathematics, artificial intelligence, software engineering, Internet of things, photonics and many others.

    Why participate ? The winners have the opportunity to enter Russian universities without exams and undergo an internship in major companies-partners of the Olympiad: Yandex, Sberbank, MRG, and so on. Students who show good results will have the opportunity to attend winter schools . There you can get acquainted with industry experts.

    The format of participationcheck in- until November 22. From November 24 to December 9, a qualifying round will be held online. It can be missed by those who have completed  at least two online courses from the list approved by the organizers  . In February 2019, the final stages will begin.

    They will be held in person at various universities in the country. ITMO University supervises immediately five areas of the Olympiad. We told about some of them, in particular, about “ Robotics ”. Today we present the direction of "Big Data". This is a novelty of this year's Olympiad.

    The direction of "Big Data": what you need to know

    There are many events and seminars in the world devoted to Big Data.

    It is worth mentioning the international conference SIGMOD , SIGKDD or ICML . More and more similar events are taking place in our country. For example, DataFest , Big Data Conference from Rusbase and numerous meetings on technology management and analysis of Big Data.

    ITMO University also participates in various events and holds its own. Such as a series of conferences YSC ( Young Science Conference ), a lecture by German Grefand the recent closed workshop at the MRG. Big data occupy an important place in the development of new IT-systems and solutions in other areas of activity. ITMO University is actively working with the application and development of Big Data technologies in all planes.
    For example, employees of the ITMO University Department of High Performance Computing created the semantic distributed data store Exarch. It provides quick access to data, optimizes their processing. Exarch allows you to cut the execution time of simple tasks in half, compared with tools like HDFS and Cassandra.
    Given the experience and research interests of the university in the field of working with big data, we could not miss the opportunity to open such a direction within the framework of the project “I am a professional”. This track of the Olympiad is supervised by Alexander V. Bukhanovsky , doctor of technical sciences, director of the mega-faculty of translational information technologies of the ITMO University. Now he is with the team, which includes graduate students of the university, is preparing tasks.

    The Big Data direction includes Data Analysis, Statistics, and Machine Learning plus Distributed Computing and Systems Technologies. The first sub-direction is related to mathematics and approaches to processing large amounts of data. The second is built around programming and high-performance computing aimed at optimizing analytical processes.

    Participants will use the Yandex.Contest platform and the most popular programming languages ​​for working with Big Data. These are Java, Scala and Python.

    Java and Scala are more commonly used by specialists, called the Data Engineer, for ETL and ELT and the implementation of basic algorithms. Python more often acts as a tool in the hands of those called Data Scientist. At the same time, all of these languages ​​are supported by the most common and currently popular solution for processing large data Apache Spark.

    Note that at the correspondence stage programming tasks will not be offered. This is due to some limitations of the Yandex.Contest site - there is no possibility to connect real data arrays for processing. To the internal stage of the competition this moment will be resolved.

    Preparing for the Olympiad

    A special program has been prepared for the participants, which includes three webinars in the profile direction. Lectures are delivered by lecturers from leading universities, explaining and analyzing examples of olympiad tasks.

    Here is an example of one of the basic questions on big data.
    Большой массив разных растровых фотоизображений в 64-битном формате bmp равномерно распределен по 1000 независимым узлам хранения в единой локальной сети. Для выделения изображений лиц на этих файлах задействован кластер, имеющий 100 вычислительных узлов.

    При однократном запуске процесса обработки на всех узлах, по сравнению с одним узлом получено ускорение обработки всего в 52 раза. Значит ли это что:

    • А. Кластер слишком маленький, нужно больше вычислительных узлов, чтобы повысить эффективность;
    • Б. Размеры изображений разные, и из-за этого объективно не удается достичь большей эффективности;
    • А. Коммуникационный канал между хранилищем и кластером слишком слабый;
    • Г. Пока непонятно. Нужно провести серию дополнительных экспериментов в различных конфигурациях.

    Ответ: Г. На основе одного измерения причину установить невозможно, так как в зависимости от условий может быть как вариант А, так и В.

    Lecture read by Alexander Bukhanovsky:

    The second lecture is about the technological aspects of big data processing. The Senior Researcher of the Research and Technological Institute of the ITMO University Alexander Visheratin conducted:

    In general, to solve the tasks of the Olympiad, it is necessary to study the typical mechanisms underlying the basic Big Data processing operations. We are talking about patterns in the frameworks Apache Spark and Apache Flink (for example, the operations shuffle or broadcast). It would be nice to study the work of iterative algorithms used for machine learning on big data, such as Expectation - Maximization . Knowledge of the data structures and storage organization principles used in modern Cassandra or Clickhouse repositories does not hurt either.

    We also recommend you to pay attention to the courses from “Yandex” dedicated to processing Big Data:

    By the way, the passage of two of these courses will allow you to pass the qualifying round in the direction of “Big Data” and get directly to the internal stage of the Olympiad.

    Also popular now: