asash May 11, 2016 at 12:40

Big Data from A to Z. Part 5.1: Hive - SQL engine over MapReduce

Tutorial

Hello, Habr! We continue our article series on data analysis tools and techniques. The next 2 articles in our series will be devoted to Hive, a tool for SQL lovers. In previous articles, we examined the MapReduce paradigm , and the techniques and strategies for working with it . Perhaps for many readers some solutions to problems using MapReduce seemed somewhat cumbersome. Indeed, almost 50 years after the invention of SQL, it seems rather strange to write more than one line of code to solve problems such as “count me the amount of transactions by region.”

On the other hand, classic DBMSs such as Postgres, MySQL, or Oracle do not have such flexibility in scaling when processing large amounts of data, and upon reaching a larger volume, further support becomes a big headache. Actually, Apache Hive was invented in order to combine these two advantages:

MapReduce scalability
Ease of use SQL for data samples.

Under the cut, we will tell you how this is achieved, how to start working with Hive, and what are the restrictions on its use.

general description

Hive appeared in the bowels of Facebook in 2007, and a year later the hive sources were opened and transferred under the control of the apache software foundation. Initially, hive was a set of scripts on top of hadoop streaming (see article 2 of our series), later it developed into a full-fledged framework for querying data on top of MapReduce.

The current version of apache hive (2.0) is an advanced framework that can work not only on top of the Map / Reduce framework, but also on top of Spark (about spark we will have separate articles in the loop), as well as Apache Tez .

Apache hive is used in production by companies such as Facebook, Grooveshark, Last.Fm and many others. We are in the Data-Centric alliance we use HIve as the main log storage of our advertising platform.

Architecture

Hive is an engine that turns SQL queries into chains of map-reduce tasks. The engine includes components such as Parser (parses incoming SQL queries), Optimimer (optimizes the query for greater efficiency), Planner (schedules tasks for execution) Executor (runs tasks on the MapReduce framework.

Metadata storage is also required for hive to work. The fact is that SQL involves working with objects such as a database, table, columns, rows, cells, etc. Since the data that hive uses is stored simply as files on hdfs, you need to store the correspondence between the hive objects and the real ones somewhere F Lamy.

As used metastorage conventional RDBMS such as MySQL, PostgreSQL or Oracle.

Command line interface

The easiest way to try hive is to use its command line. A modern utility for working with hive is called beeline (hello to our partners from the operator of the same name :)) For this, on any machine in a hadoop cluster (see our hadoop tutorial ) with hive installed, just type the command.

$beeline

Next, you need to connect to the hive server:

beeline> !connect jdbc:hive2://localhost:10000/default root root
Connecting to jdbc:hive2://localhost:10000/default
Connected to: Apache Hive (version 1.1.0-cdh5.7.0)
Driver: Hive JDBC (version 1.1.0-cdh5.7.0)
Transaction isolation: TRANSACTION_REPEATABLE_READ
0: jdbc:hive2://localhost:10000/default>

root root- in this context, this is the username and password. After that, you will get a command line where you can enter commands hive.

It is also sometimes convenient not to enter sql queries into the beeline command line, but to save and edit them in a file first, and then execute all the queries from the file. To do this, run beeline with the parameters for connecting to the database and the -f parameter indicating the name of the file containing the requests:

beeline -u jdbc:hive2://localhost:10000/default -n root -p root -f sorted.sql

Data units

When working with hive, the following objects can be distinguished by which hive operates:

Database
Table
Partition
Bucket

We will analyze each of them in more detail:

Database

The database is an analogue of the database in relational DBMS. A database is a namespace containing tables. The command to create a new database is as follows:

CREATE DATABASE|SCHEMA [IF NOT EXISTS]

Database and Schema in this context are one and the same. The optional addition of IF NOT EXISTS, as you might guess, creates a database only if it does not already exist.

An example of creating a database:

CREATE DATABASE userdb;

To switch to the appropriate database, use the USE command:

USE userdb;

Table

A table in hive is an analogue of a table in a classic relational database. The main difference is that the data from hive tables is stored simply as regular files on hdfs. It can be ordinary text csv files, binary sequence files, more complex column paruqet files and other formats. But in any case, the data over which the hive table is configured is very easy to read and not from hive.

There are two types of tables in hive:

A classic table , the data in which is added using hive. Here is an example of creating such a table ( example source ):

CREATE TABLE IF NOT EXISTS employee ( eid int, name String,
salary String, destination String)
COMMENT 'Employee details'
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t'
LINES TERMINATED BY '\n'
STORED AS TEXTFILE;

Here we created a table, the data in which will be stored in the form of ordinary csv files, the columns of which are separated by a tab character. After that, the data can be loaded into the table. Let our user in the home folder on hdfs have (I remind you that you can download the file using hadoop fs -put ) a sample.txt file of the form:

1201	Gopal	45000	Technical manager
1202	Manisha	45000	Proof reader
1203	Masthanvali	40000	Technical writer
1204	Kiran	40000	Hr Admin
1205	Kranthi	30000	Op Admin

We can download the data using the following command:

LOAD DATA INPATH '/user/root/sample.txt'
OVERWRITE INTO TABLE employee;

After hive moves the data stored in our file to the hive repository. You can verify this by reading the data directly from the file in the hive repository in hdfs:

[root@quickstart ~]# hadoop fs -text /user/hive/warehouse/userdb.db/employee/*
1201  Gopal       45000    Technical manager
1202  Manisha     45000    Proof reader
1203  Masthanvali 40000    Technical writer
1204  Kiran       40000    Hr Admin
1205  Kranthi     30000    Op Admin

Classic tables can also be created as a result of a select query on other tables:

0: jdbc:hive2://localhost:10000/default> CREATE TABLE big_salary as SELECT * FROM employee WHERE salary > 40000;
0: jdbc:hive2://localhost:10000/default> SELECT * FROM big_salary;
+-----------------+------------------+--------------------+-------------------------+--+
| big_salary.eid  | big_salary.name  | big_salary.salary  | big_salary.destination  |
+-----------------+------------------+--------------------+-------------------------+--+
| 1201            | Gopal            | 45000              | Technical manager       |
| 1202            | Manisha          | 45000              | Proof reader            |
+-----------------+------------------+--------------------+-------------------------+--+

By the way, SELECT to create a table in this case will already launch a mapreduce task.

An external table to which data is loaded by external systems, without hive. To work with external tables when creating a table, you need to specify the EXTERNAL keyword , and also specify the path to the folder where the files are stored:

CREATE EXTERNAL TABLE IF NOT EXISTS employee_external ( eid int, name String,
salary String, destination String)
COMMENT 'Employee details'
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t'
LINES TERMINATED BY '\n'
STORED AS TEXTFILE
LOCATION '/user/root/external_files/';

After that, the table can be used in the same way as regular hive tables. The most convenient thing is that you can simply copy the file to the desired folder in hdfs, and hive will automatically pick up new files when querying the corresponding table. This is very convenient when working with logs, for example.

Partition

Since hive is an engine for translating SQL queries into mapreduce tasks, usually even the simplest queries to a table lead to a complete scan of the data in this table. In order to avoid a full scan of the data for some of the columns of the table, you can partition this table. This means that data related to different values will be physically stored in different folders on HDFS.

To create a partitioned table, you need to specify which columns will be partitioned:

CREATE TABLE IF NOT EXISTS employee_partitioned ( eid int, name String,
salary String, destination String)
COMMENT 'Employee details'
PARTITIONED BY (birth_year int, birth_month string)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t'
LINES TERMINATED BY '\n'
STORED AS TEXTFILE;

When filling data into such a table, it is necessary to explicitly indicate in which partition we fill the data:

LOAD DATA INPATH '/user/root/sample.txt' OVERWRITE
INTO TABLE employee_partitioned
PARTITION (birth_year=1998, birth_month='May');

Now let's see what the directory structure looks like:

[root@quickstart ~]# hadoop fs -ls /user/hive/warehouse/employee_partitioned/
Found 1 items
drwxrwxrwx   - root supergroup          0 2016-05-08 15:03 /user/hive/warehouse/employee_partitioned/birth_year=1998
[root@quickstart ~]# hadoop fs -ls -R /user/hive/warehouse/employee_partitioned/
drwxrwxrwx   - root supergroup          0 2016-05-08 15:03 /user/hive/warehouse/employee_partitioned/birth_year=1998
drwxrwxrwx   - root supergroup          0 2016-05-08 15:03 /user/hive/warehouse/employee_partitioned/birth_year=1998/birth_month=May
-rwxrwxrwx   1 root supergroup        161 2016-05-08 15:03 /user/hive/warehouse/employee_partitioned/birth_year=1998/birth_month=May/sample.txt

It can be seen that the directory structure looks so that each partition corresponds to a separate folder on hdfs. Now, if we run any queries, having shown in the WHERE clause a restriction on the values of partitions, mapreduce will take the input data only from the corresponding folders.

In the case of External tables, partitioning works in a similar way, but a similar directory structure will have to be created manually.

Partitioning is very convenient, for example, for separating logs by date, since, as a rule, any requests for statistics contain a date limit. This can significantly reduce the request time.

Bucket

Partitioning helps to reduce processing time, if query limits are usually known with restrictions on the values of a column. However, it is not always applicable. For example, if the number of values in a column is very large. For example, it can be a user ID in a system containing several million users.

In this case, we will come to the aid of dividing the table into buckets. The rows of the table for which the value matches the value of the hash function calculated by a certain column fall into one bucket.

When working with bucketed tables, you must remember to enable bucket support in hive (otherwise hive will work with them as with regular tables):

set hive.enforce.bucketing=true;

The CLUSTERED BY clause is used to create a table broken into buckets.

set hive.enforce.bucketing=true;
CREATE TABLE employee_bucketed ( eid int, name String, salary String, destination String)
CLUSTERED BY(eid) INTO 10 BUCKETS
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t'
LINES TERMINATED BY '\n'
STORED AS TEXTFILE;

Since the Load command is used to simply move data to the hive storage, in this case it is not suitable for loading, since the data must be preprocessed by correctly breaking them into buckets. Therefore, they must be loaded using the INSERT command from another table (for example, from an external table):

set hive.enforce.bucketing=true;
FROM employee_external INSERT OVERWRITE TABLE employee_bucketed SELECT *;

After executing the command, make sure that the data really broke into 10 parts:

[root@quickstart ~]# hadoop fs -ls /user/hive/warehouse/employee_bucketed
Found 10 items
-rwxrwxrwx   1 root supergroup   31555556 2016-05-08 16:04 /user/hive/warehouse/employee_bucketed/000000_0
-rwxrwxrwx   1 root supergroup   31555556 2016-05-08 16:04 /user/hive/warehouse/employee_bucketed/000001_0
-rwxrwxrwx   1 root supergroup   31555556 2016-05-08 16:04 /user/hive/warehouse/employee_bucketed/000002_0
-rwxrwxrwx   1 root supergroup   31555556 2016-05-08 16:04 /user/hive/warehouse/employee_bucketed/000003_0
-rwxrwxrwx   1 root supergroup   31555556 2016-05-08 16:04 /user/hive/warehouse/employee_bucketed/000004_0
-rwxrwxrwx   1 root supergroup   31555556 2016-05-08 16:04 /user/hive/warehouse/employee_bucketed/000005_0
-rwxrwxrwx   1 root supergroup   31555556 2016-05-08 16:04 /user/hive/warehouse/employee_bucketed/000006_0
-rwxrwxrwx   1 root supergroup   31555556 2016-05-08 16:04 /user/hive/warehouse/employee_bucketed/000007_0
-rwxrwxrwx   1 root supergroup   31555556 2016-05-08 16:04 /user/hive/warehouse/employee_bucketed/000008_0
-rwxrwxrwx   1 root supergroup   31555556 2016-05-08 16:04 /user/hive/warehouse/employee_bucketed/000009_0

Now, when querying for data related to a specific user, we will not need to scan the entire table, but only 1/10 of this table.

Checklist for using hive

Now we have parsed all the objects that hive operates on. After the tables are created, you can work with them, since with tables of ordinary databases. However, do not forget that hive is still an engine for launching mapreduce tasks on ordinary files, and it is not a full-fledged replacement for the classic DBMS. Thoughtless use of such heavy teams like JOIN can lead to very long tasks. Therefore, before building your architecture based on hive - you need to think a few times. Here's a quick checklist on using hive:

There is a lot of data that needs to be processed and they will not fit onto the disk of one machine (otherwise it is better to think about classical SQL-systems).
Data is basically only added and rarely updated (if updates are frequent - maybe you should think about using Hbase for example, see our previous article .
Data has a well-structured structure and is well divided into columns.
Data processing patterns are well described by declarative query language (SQL).
The response time to the request is not critical (since hive works on the basis of MapReduce - you should not expect interactivity).

Conclusion

In this article, we looked at the hive architecture, the data units used by hive, gave examples of creating and populating hive tables. In the next article in the series, we will look at advanced hive features, including:

Transactional model
Indices
User-defined functions
Hive integration with non-hdfs data warehousing

Author’s Youtube channel on data analysis

Links to previous articles in the series:

» Big Data from A to Z. Part 1: Principles of working with big data, MapReduce paradigm
» Big Data from A to Z. Part 2: Hadoop
» Big Data from A to Z I. Part 3: Techniques and strategies for developing MapReduce applications
» Big Data from A to Z. Part 4: Hbase

Tags: