Welcome to HadoopKitchen
We are in a hurry to inform you about our new initiative, which will be interesting both for programmers and a number of other IT specialists: September 27, next Saturday, the first meeting of HadoopKitchen will be held in the Moscow office of Mail.Ru Group . Why Hadoop, and why might this meeting be interesting for non-programmers?
- Hadoop is the center of this ecosystem, with numerous projects and technologies associated with it.
- Many companies rely entirely on commercial Hadoop distributions.
- Hadoop enters the product lines of almost all major suppliers of information technology, which indicates its relevance and popularity.
The program of the first Hadoop meeting will be very eventful, as many as four speakers will speak. All of them are wonderful specialists with great experience, who want to share with the audience. Under the cut, read the program of events and announcements of reports.
Program of the event :
11:00 Registration and welcome coffee.
12:00 Alexey Filanovsky (Cloudera Certified Developer for Apache Hadoop, Senior Sales Consultant, Oracle) will talk about new interesting features of Hadoop v2. Of course, this will not be a dry enumeration with brief descriptions, Alexey will also analyze different scenarios for using these features, and at the same time he will talk about some examples from practice.
The Hadoop ecosystem is gaining popularity by leaps and bounds, more and more users are beginning to use it not only for synthetic tests, to satisfy their own curiosity, but also in the productive environment of the enterprise. This fact explains the rapid development of the product. More users, more wishes for developers. As part of this report, the main features that have appeared in Hadoop v2 will be highlighted.
13:00 Nikita Makeev (Data Team lead, IponWeb) will tell the audience special knowledge on how to expand the capabilities of Hadoop Streaming when working with modern data formats Avro and Parquet.
Map-Reduce, Avro and Parquet without Java. Almost. Hadoop Streaming is a great way to saddle Hadoop in particular and batch process large amounts of data in general. There is almost no need to know Java, but only roughly understand how MapReduce works, and be able to write in some programming language that can process lines of text. Almost any problem that can be solved with MapReduce can also be solved with Hadoop Streaming. The advantages are obvious - ease of development, no problems with personnel, low entry costs.
One of the most common uses of Hadoop Streaming is the processing of text logs or other data presented as text. However, more complex formats than just text are rapidly gaining popularity. Is it possible to retain the ability to process data using scripting languages and at the same time use all the advantages that modern data formats, such as Avro and Parquet, have?
We cope with this task, using a certain amount of Java code and JSON as a connecting link. As usual, everywhere there are nuances, features, and often special unique "rakes" about which we will talk.
14:00 Maxim Lapan(Lead Programmer for Search, Mail.Ru Group) tells a fascinating story about how Hadoop clusters are managed in Mail.Ru Group. The speaker will not pass by the difficulties that the development team got in the way as the system developed and expanded. The report will focus on the practical side of operating the Hadoop / HBase cluster, which has been used in the Mail.Ru Search project for the past three years. During this time, the system grew from 30 to 400 servers, the storage volume from 400TB to 9PB. Topics to be addressed:
- how we invented our bigtop: the structure and logic of our rpm-package assemblies, support for multiple clusters, user work, configuration features of Hadoop components;
- monitoring and analysis of cluster performance: how we monitor the operation of clusters, what metrics we use;
- Hadoop / HBase large installation administration issues.
15:00 Lunch. War is war, and lunch is scheduled.
From 15:45 to 17:45 in the World Cafe format, everyone will be able to participate in the joint determination and discussion of the most pressing issues of Hadoop operation.
At 18:00, Alexey Grishchenko (Pivotal Enterprise Architect, EMC Corporation) will give a presentation on what features and nuances are characteristic of the architectural solution Pivotal HAWQ, and also talk about his interaction with Hadoop. The report will cover the following topics:
- The current market position for solutions that implement the SQL interface for working with data in HDFS. Recently, this topic has been very actively gaining popularity, which is largely due to the popularization of Hadoop in the corporate sector. I will briefly cover the main currently existing solutions and fundamental problems that all such systems face.
- Pivotal HAWQ solution components and their interaction with HDFS. Here I will tell you in detail about what components our DBMS consists of, how they are located on a cluster, how they are connected to HDFS and how they store data
- Detailed analysis of the query execution process. As an example, a simple request will be given, the process of its execution will be described in steps from the receipt of the request to the system to the return of data to the client application. Also here I will briefly talk about the distinctive features of query processing in HAWQ compared to other systems.
- Possibilities of organizing access to custom formats for storing data on HDFS, as well as to various external systems. Here I will talk about the PXF framework and the possibility of expanding it; I will give an example of the component I implemented
- Other HAWQ opportunities and direction for further development. I will talk about the possibilities of using HAWQ to solve the problem of data mining, as well as highlight the direction in which our platform is developing and what changes are worth waiting for.
Be sure to bring an identification document with you, we have strict security. You will also need to register .