Hive vs Pig. What do I need so much ETL for?

  • Tutorial
image

It’s better to lose a day, but then fly in five minutes (s )



Hello colleagues.
I want to share with you my thoughts on how the Hive and Pig frameworks in the Hadoop ecosystem differ. In fact, these are two very similar products, the goal of which is the same - to take on the entire technical implementation of MapReduce, providing in return an opportunity to describe the data processing process at a more abstract level. In this article, we will see how the samples look in these two systems, and try to understand in what cases it is necessary to use one or another solution.

Hive


So, let's start with Hive. Its main feature is the SQL-like HQL query language. It allows you to work with data in the way we are used to, as if we were working with a regular relational database. Scripts can be run either through the console or using the command line.

Hive is:
  • SQL-like HQL
  • Interactive console
  • Built-in Aggregation Functions
  • User Feature Support (UDF)
  • Data is like a table

Hive can work:
  • with text files (you can specify a delimiter character)
  • with compressed text files (Gzip, Bzip)
  • with arrays, dictionaries, unions (union)
  • has a huge number of built-in functions for working with: collections, dates, strings, JSON
  • with mathematical functions (rounding, logarithms, roots, trigonometry)
  • with aggregation functions (sum, min, max, avg ...)
  • If all of the above was not enough, then you can use custom functions, as well as mappers and reducers (python, java)


A simple example:
--Создадим внешнюю таблицу. (Описание структуры лога)
CREATE EXTERNAL TABLE win_bids_log (
date_field string,
request_id string,
user_ssp_id string,
dsp_id string,
win_price int
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t'
LOCATION 'hdfs://inpit/bid-logs';
CREATE EXTERNAL TABLE win_bids_by_dsp (
dsp_id string,
win_bids_cout int,
win_price int
) 
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t'
STORED AS TEXTFILE
LOCATION ''hdfs://output/win-bids-by-dsp'';
INSERT OVERWRITE TABLE win_bids_by_dsp
SELECT dsp_id, COUNT(dsp_id), SUM(win_price) FROM win_bids_log  GROUP BY dsp_id;

As you can see, everything is quite simple and clear. Pretty nice to write queries in a familiar language. But this happiness continues until you have to deal with more complex requests.

An example is more complicated:
INSERT OVERWRITE TABLE movieapp_log_stage 
SELECT * FROM ( 
	SELECT custid, movieid, 
	 CASE WHEN genreid > 0 THEN genreid ELSE -1 END genreid, time, 
	 CAST((CASE recommended WHEN 'Y' THEN 1 ELSE 0 END) AS INT) recommended, activity, 
	 CAST(null AS INT) rating, price 
	 FROM movieapp_log_avro 
	 WHERE activity IN (2,4,5,11) 
	UNION ALL 
	SELECT 
	 m1.custid, m1.movieid, 
	 CASE WHEN m1.genreid > 0 THEN m1.genreid ELSE -1 END genreid, m1.time, 
	 CAST((CASE m1.recommended WHEN 'Y' THEN 1 ELSE 0 END) AS INT) recommended, 
	 m1.activity, m1.rating, 
	 CAST(null as float) price 
	 FROM movieapp_log_avro m1 
	 JOIN (
		 SELECT custid,movieid, 
		 CASE WHEN genreid > 0 THEN genreid ELSE -1 END genreid,MAX(time) max_time, 
		 activity 
		 FROM movieapp_log_avro 
		 GROUP BY custid, movieid, genreid, activity 
	 ) m2 
	 ON ( 
		 m1.custid = m2.custid 
		 AND m1.movieid = m2.movieid 
		 AND m1.genreid = m2.genreid 
		 AND m1.time = m2.max_time 
		 AND m1.activity = 1 
		 AND m2.activity = 1 
	 ) 
) union_result;


Of course, you can figure it out, but still it’s worth recognizing that in this case there is definitely a lack of orderliness. Put it all on the shelves, but with comments. Is not it?

Total:

Hive pluses:
  • Good old SQL is good for describing samples. Yes, and just everyone knows him.
  • MapReduce under the hood. It takes a lot of overhead related to strapping around MR. Description of data models, input and output formats, MR task chains.
  • Interactivity Good for analyzing data in different slices.
  • Development speed
  • Lack of dependencies, compilation, assembly (all this is hidden)


Hive Cons:
  • Not everything can be put in the HQL paradigm
  • This fits well with simple samples. But with increasing complexity, it becomes harder and harder to understand them. Especially if you didn’t write the sample


Pig



Now let's talk about Pig. It is based on the procedural language Pig Latin. To figure it out you need to spend some time.
Let's figure it out and hike to find out the differences from the Hive

Pig:
  • Pig Latin
  • Interactive console
  • Built-in Aggregation Functions
  • User Feature Support (UDF)
  • Data - in the form of structures (Tuple, Bag)

Pig can work:
  • with text files (you can specify a delimiter character)
  • with compressed text files (Gzip, Bzip)
  • with arrays, dictionaries, unions (union)
  • has a huge number of built-in functions for working with: dates, strings, structures
  • with mathematical functions (rounding, logarithms, roots, trigonometry)
  • with aggregation functions (sum, min, max, avg ...)
  • If all of the above was not enough, then you can use custom functions (jython, java)

As you can see, Pig can do the same as Hive. The only difference is in the presentation of data and in the language. But it is precisely this difference that takes Pig to a whole new level.

Let's take a closer look at Pig.
This framework works with special data structures - Tuple and Bag.
  • Tuple is an ordered set of fields. A structure whose fields can be accessed by index and / or name.
  • Bag - a collection (set) of Tuple.


Pig Latin basic functions:
  • LOAD
  • STORE
  • GENERATE
  • Join
  • GROUP
  • Filter
  • UNION
  • Distinct
  • ORDER

Let's look at an example of how you can transform data in the process of working with Pig. We will work with the log file of the RTB exchange. The data are presented as follows:
  • time - time
  • bid_id - bid identifier,
  • user_id - user id
  • dsp_id - identifier of the bidder (player)
  • bid - bid

Pig - data loading (LOAD)

The LOAD function is used for loading, we also specify the separator character '\ t' and the data signature (you can specify the type if necessary).
--почистим выходную директорию HDFS (Pig поддерживает команды Hadoop)
fs -rm -f -r -skipTrash /data/pig/out
--загрузим данные в переменную 'raw_data'
raw_data = LOAD '/data/pig/example/' USING PigStorage('\t') AS (time, bid_id, user_id, dsp_id, bid:int);

At the output, we get just such a structure (Tuple). In queries, its fields can be accessed through a dot. For example: raw_data.dsp_id
raw_data -> tuple с именованными полями.
------------------------------------------------------------------------------------------- 
time,  bid_id,  user_id,  dsp_id,  bid
------------------------------------------------------------------------------------------- 
(2014.02.14 14:08:27.711,  56949,  45234534553459,  DSP-2,  12)
(2014.02.14 14:08:28.712,  61336,  45221696259999,  DSP-1,  56)
(2014.02.14 14:08:29.713,  74685,  45221699381039,  DSP-2,  89)
(2014.02.14 14:08:30.714,  56949,  45221695781716,  DSP-1,  21)
(2014.02.14 14:08:25.715,  27617,  45221682863705,  DSP-3,  22)

Pig - iterative data processing (FOREACH - GENERATE)
FOREACH - GENERATE allows you to iteratively “run” through a data set and apply any operations to each record, or simply send certain fields to the output, removing everything that is not needed.
--Нормализуем данные. Обрежем timestamp с помощью SUBSTRING
norm_data = FOREACH raw_data GENERATE SUBSTRING(time, 0,10) AS date, dsp_id, bid;

At the output we get the same set, but with a trimmed date, and only two fields: dsp_id, bid.

norm_data -> tuple с именованными полями и обрезанной датой
---------------------------------------
date,   dsp_id,   bid
---------------------------------------
(2014.02.14,   DSP-2,   12)
(2014.02.14,   DSP-1,   56)
(2014.02.14,   DSP-2,   89)
(2014.02.14,   DSP-1,   21)

Pig - data grouping (GROUP)
GROUP - allows you to group data, while giving out a non-trivial structure.
--Сгруппируем по dsp_id и date 
group_norm_data = GROUP norm_data BY (dsp_id, date);

At the output we have: a
group as a key. It can be accessed via the group prefix.
and aggregate collection prefixed with norm_data
group_norm_data -> (группа как ключ) : [ (norm_data), (norm_data) ]
----------------------------------------------------------------------------------
 ( group),  array of norm_data
----------------------------------------------------------------------------------
( (DSP-1,  2014.02.14),  {(2014.02.14,  DSP-1,  56),  (2014.02.14,  DSP-1,  21)} )
( (DSP-1,  2014.02.17),  {(2014.02.17,  DSP-1,  34),  (2014.02.17,  DSP-1,  24)} )
( (DSP-2,  2014.02.14),  {(2014.02.14,  DSP-2,  89),  (2014.02.14,  DSP-2,  12)} )

Pig - aggregate sweep (FLATTEN)
Sometimes it is necessary to deploy aggregates into a linear structure (“straighten”).
There is a FLATTEN function for this.
-- Разворачиваем агрегаты в линейную структуру
ft_group_norm_data = FOREACH group_norm_data GENERATE FLATTEN(group), FLATTEN(norm_data);

From a complex grouped structure, we get a straightforward set of Tuples.
ft_group_norm_data -> tuple с именованными полями
----------------------------------------------------------------------
dsp_id, date date dsp_id bid
-----------------------------------------------------------------------
(DSP-1, 2014.02.14, 2014.02.14, DSP-1, 56)
(DSP-1, 2014.02.14, 2014.02.14, DSP-1, 21)
(DSP-1, 2014.02.15, 2014.02.15, DSP-1, 15)
(DSP-1, 2014.02.15, 2014.02.15, DSP-1, 31)

Pig - Aggregation Functions (SUM)
Let's count something. For example, the amount of daily bets made by each bidder.
--Вычислим сумму дневных ставок, сделанных каждым биддером
sum_bids_dsp = FOREACH group_norm_data GENERATE group, SUM(norm_data.bid) AS bids_sum;


sum_bids_dsp -> группа : bids_sum
------------------------------------------------------
   group,   bids_sum
------------------------------------------------------
( (DSP-1, 2014.02.16),    82)
( (DSP-1, 2014.02.17),    58)
( (DSP-2, 2014.02.14),    101)
( (DSP-2, 2014.02.16),    58)

Pig - GROUP ALL
It is often necessary to calculate the number of “records” in a sample. Simply applying COUNT to the selection will fail. The data must be collapsed into one group and only then apply the aggregation functions.
--Вычислим общую сумму, и количество групп.
--Для этого свернем всё в одну группу.
group_all = GROUP sum_bids_dsp ALL;


At the output we have a group - “all” and a collection of all previous units.
( all, { ((DSP-1,2014.02.14),77), ((DSP-1,2014.02.15),67), ((DSP-1,2014.02.16),82),((DSP-1,2014.02.17),58),((DSP-2,2014.02.14),101),((DSP-2,2014.02.16),58),((DSP-2,2014.02.17),123),((DSP-3,2014.02.14),22),((DSP-3,2014.02.15),109),((DSP-3,2014.02.16),136),((DSP-3,2014.02.17),81) } )

Now calculate the amount and amount
summary = FOREACH group_all GENERATE COUNT(sum_bids_dsp), SUM(sum_bids_dsp.bids_sum);

Exit
------------------------------------------------------
  count,   sum
------------------------------------------------------
(11,         914)

In my opinion, this is what you need. Data processing is presented in an ordered manner. Everything is easily broken into steps. Each stage can be provided with comments.

Total:

Pig Pros:
  • The procedural approach. Orderliness! The language allows you to break the logic into blocks, each step can be described in detail with comments.
  • MapReduce forming under the hood. It takes a lot of overhead related to strapping around MR. Description of data models, input and output formats, MR task chains.
  • Interactivity Good for analyzing data in different slices.
  • Speed ​​of development. No dependencies, builds

Pig Cons:
  • Not everything can be put into Pig Latin
  • Pig Latin along with data structures is more complex, unlike HQL
  • For UDF, Jython is used. This may limit the use of some libraries.

Summary:

  • Hive is good for small and uncomplicated samples. HQL is similar to SQL, so you can start working with this framework very quickly.
  • Pig Requires learning the language and data structures. But then, once you figure it out, you get a more powerful tool that makes it easier to implement complex and multi-stage samples. You get simple and orderly code, with accessible and relevant comments.

If you and your colleagues are well aware of SQL, work with it daily, and you are not embarrassed by furious queries, then Hive is a great solution. However, if you work with SQL occasionally and your data workflow does not fit into simple queries, then it is definitely worth spending a day and sorting out Pig. In the future, this can save you a lot of time and your colleagues.

Also popular now: