mkrupenin January 19, 2017 at 01:12

HPE Vertica and Exasol Analytical Database Engine Performance Comparison Using TPC-H Benchmark

In this article I want to continue the topic of comparing databases that can be used to build a data warehouse (DWH) and analytics. Earlier, I described test results for Oracle In-Memory Option and In-Memory RDBMS Exasol . In this article, the focus will be on Vertica. For all the tests described, we used tpc-h benchmark on a small amount of source data (2 GB) and the database configuration on one node. These restrictions allowed me to repeat the benchmark many times in different variations and with different settings. To select an analytical DBMS for a specific project, I urge readers to conduct tests on their cases (data, queries, equipment, and other features).

Brief information about Vertica DBMS

Vertica is a relational analytic column-oriented MPP database. There are enough articles on Habré describing the main features of this DBMS (at the end of the article there are links to some of them), therefore I will not describe them and I will mention only a few, in my opinion, interesting facts about Vertica:

Facebook uses Vertica for its internal analytic tasks. 2 years ago there was a cluster of hundreds of servers and dozens of petabytes of data ... I did not find relevant information about this project, maybe someone will share a reliable link in the comments.
Vertica was developed by a team led by Michael Stonebreaker (originally called the C-Store ). It was written from scratch specifically for analytical tasks, taking into account Michael's great previous experience (Ingres, Postgres, Informix, and other DBMSs). To compare the approaches, we can recall Vertica's rival Greenplum (now owned by Dell), this is the MPP DBMS, which is based on a modified PostgreSQL database.
In 2016, Hewlett-Packard Enterprise (HPE) sold its software business along with Vertica to Micro Focus. How this will affect the development of Vertica is not yet clear, but I really hope that this deal will not ruin an excellent product.
In the context of comparison with Exasol, it is important to note that Vertica is not an in-memory database and moreover, Vertica does not have a buffer pool . That is, the database is primarily intended for processing volumes of data that significantly exceed the size of RAM, and by refusing to support the buffer cache, you can save a significant part of the server’s resources. At the same time, Vertica effectively leverages the capabilities of the file system and in particular caching.

TPC-H Benchmark

For those who have not read my previous 2 articles, I will briefly describe tpc-h benchmark . It is designed to compare the performance of analytical systems and data warehouses. This benchmark is used by many manufacturers of both DBMS and server hardware. There are many results available on the tpc-h page, for the publication of which it is necessary to fulfill all the requirements of the specification on 136 pages. I was not going to publish officially my tests, so I did not strictly follow all the rules. I note that the ranking does not have a single test DBMS Vertica.

TPC-H allows you to generate data for 8 tables using the specified scale factor, which determines the approximate amount of data in gigabytes. For all the tests whose results I publish, I limited myself to 2 GB.

The benchmark includes 22 SQL queries of varying complexity. I note that the queries generated by the qgen utility need to be adjusted for the particular DBMS, but like Exasol, Vertica supports the ANSI SQL-99 standard and all queries for these 2 DBMSs were absolutely identical. For the test, 2 types of load were generated:

8 virtual users in parallel 3 times in a circle perform all 22 requests
2 virtual users in parallel 12 times in a circle perform all 22 requests

As a result, in both cases, the execution time of 528 SQL queries was estimated.

Test site

Notebook with the following specifications:
Intel Core i5-4210 CPU 1.70GHz - 4 virt. processors; DDR3 16 Gb; SSD Disk.
OS:
MS Windows 8.1 x64
VMware Workstation 12 Player
Virtual OS: Ubuntu 14.04.4 x64 (Memory: 8 Gb; Processors: 4)
DBMS:
Vertica Analytic Database v7.2.2-1 ( single node )

Physical Data Model in Vertica

The amount of disk space used and query performance in Vertica is highly dependent on the sort order and the compression algorithm for projection columns. Based on this, I performed my tests in several stages. At the first stage, only super projections were created in this way:

CREATE TABLE ORDERS  ( O_ORDERKEY       INT NOT NULL,
                           O_CUSTKEY        INT NOT NULL,
                           O_ORDERSTATUS    CHAR(1) NOT NULL,
                           O_TOTALPRICE     NUMERIC(15,2) NOT NULL,
                           O_ORDERDATE      DATE NOT NULL,
                           O_ORDERPRIORITY  CHAR(15) NOT NULL,  
                           O_CLERK          CHAR(15) NOT NULL, 
                           O_SHIPPRIORITY   INT NOT NULL,
                           O_COMMENT        VARCHAR(79) NOT NULL)
PARTITION BY (date_part('year', ORDERS.O_ORDERDATE));

The two largest tables ORDERS and LINEITEM were partitioned by year. Since the benchmark was carried out on 1 node, there was no segmentation (sharding). In the subsequent stages, the physical structure was optimized using the Database Designer, more on that below.

Uploading data to Vertica

To load data from a text file, I used the following script:

COPY tpch.lineitem FROM LOCAL 'D:\lineitem.tbl' exceptions 'D:\l_error.log';

Download time for all files was 5 minutes. 21 sec (in Exasol 3 min. 37 sec.). In this way, the data is initially loaded line by line into RAM into WOS containers (default wosdata pool settings: maxmemorysize = 25% ), then automatically to disk in a row by row into ROS containers. I also tested downloading from files and from Oracle using the ETL Pentaho DI tool (aka Kettle), it turns out much slower even with a special plugin for Vertica.

Test results

* In the previous test, the execution of queries in Exasol took significantly less time due to caching of the results (some requests in the test do not change, for some parameters values are generated). There is no such caching in Vertica and to equalize the odds, I disabled it in Exasol:

alter session set QUERY_CACHE = 'OFF';

Test sequence in Vertica

Stage 1. 1st launch

The first test run was performed after loading data into super projections without collecting statistics . The lead time was 581 seconds for 2 sessions and 680 seconds for 8 sessions. With repeated execution, the time was reduced minimally (see the table above).

The following table provides information on how the data was organized in Exasol and Vertica after the initial download:

Request for information on the occupied space on disk and in memory in Vertica:

SELECT 
       ANCHOR_TABLE_NAME,
       PROJECTION_NAME,
       USED_BYTES/1024/1024 as USED_Mb,
       ROS_USED_BYTES/1024/1024 as ROS_Mb,
       WOS_USED_BYTES/1024/1024 as WOS_Mb
FROM PROJECTION_STORAGE 
WHERE ANCHOR_TABLE_SCHEMA='tpch'
order by 1,3 desc;

The table shows that Vertica compresses the data a little better, despite the fact that the projections were not created in the optimal way. In the process of testing, I also tried to optimize the structure using DB Designer based on the loaded data and excluding queries. The compression ratio is 6.

Stage 2. Statistics collection

After collecting statistics from the tables, the execution time unexpectedly increased by about 30%. Analysis of statistics and query execution plans showed that for most requests, the execution time slightly decreased or did not change, but for a couple of requests it increased significantly. In these queries, many tables were joined, including ORDERS and LINEITEM , and a lower cost corresponded to a longer run time.

Stage 3. Structure optimization with DB Designer

Comprehensive design was created with the Query perfomance (larger footprint) option based on the 21st tpc-h benchmark request (1 is skipped, because a view is created for it before execution). The result is the following numbers:

Thus, 9 additional projections were created, but the total amount of data on the disk almost did not change due to the optimization of the structure (the sequence of columns and various compression algorithms). However, the new structure has further slowed down problematic queries and, accordingly, the overall execution time.

Stage 4. Manual optimization

Considering my previous experience with Vertica on other data models (mainly star schema), I expected better results from the database, so I decided to look deeper for bottlenecks. To do this, the following steps were performed:

Analysis of statistics and query execution plans - system tables v_monitor: query_requests, query_plan_profiles, execution_engine_profiles, query_events;
Analysis of database recommendations on the result of the ANALYZE_WORKLOAD () function ;
Creation of several additional projections;
Change settings for general pool.

All this did not bring significant results.

Next, the problematic request was rewritten, which took about 30% of the execution time of all requests:

Request Source:

select
	nation,
	o_year,
	sum(amount) as sum_profit
from
	(
		select
			n_name as nation,
			extract(year from o_orderdate) as o_year,
			l_extendedprice * (1 - l_discount) - ps_supplycost * l_quantity as amount
		from
			part,
			supplier,
			lineitem,
			partsupp,
			orders,
			nation
		where
			s_suppkey = l_suppkey
			and ps_suppkey = l_suppkey
			and ps_partkey = l_partkey
			and p_partkey = l_partkey
			and o_orderkey = l_orderkey
			and s_nationkey = n_nationkey
			and p_name like '%thistle%'
	) as profit
group by
	nation,
	o_year
order by
	nation,
	o_year desc;

Rewritten request code using optimizer hint:

select
     /*+SYNTACTIC_JOIN */       
	n_name as nation,
	extract(year from o_orderdate) as o_year,
	sum(l_extendedprice * (1 - l_discount) - ps_supplycost * l_quantity) as amount
		from lineitem
		join orders on o_orderkey = l_orderkey
		join partsupp on ps_suppkey = l_suppkey	and ps_partkey = l_partkey
		join part on p_partkey = l_partkey and p_name like '%thistle%'
		join supplier on s_suppkey = l_suppkey
	       	join nation on s_nationkey = n_nationkey
group by 	
	n_name,
	extract(year from o_orderdate)
order by 1, 2 desc;

As a result, the request accelerated by about 4 times.

As a result, the minimum test run time in 2 sessions with the structure created by DB Designer and 1 query correction was 531 seconds (the very first run without optimizations lasted 581 seconds).

I stopped at this, because there was no goal to get the most out of it by rewriting requests, changing the model, and other not entirely “honest” methods.

conclusions

This test once again confirms the rule that it is always necessary to choose tools and implementation options that will be optimal for your specific task / project. The tpc-h benchmark with my given restrictions is "inconvenient" for Vertica DBMS for the following reasons:

All data was placed in RAM, and Vertica is not an in-memory DB;
Normalized tpc-h model and optimizer errors. When ad-hoc query performance is a top priority, a denormalized model (such as star schema) is better for the presentation storage layer in Vertica. A Exasol database also copes with a normalized model, which, in my opinion, is its great advantage, since it is possible to reduce the number of DWH layers.

Vertica is not overloaded with extra functionality and is relatively simple to develop and administer, but Exasol is even simpler in this regard and does almost everything for you. Which is better, flexibility or simplicity, depends on the specific task.

Vertica and Exasol license prices are comparable, and free, limited versions are available. In the process of choosing an analytical DBMS, I would recommend considering both products.

Useful links about Vertica

Overview of architecture and key features;
Most of all he shares his practical experience in Russian, probably Alexey Konstantinov ascrus . Thanks to him for this, I recommend all of his publications on Habré and in the blog ;
Introductory article and practical experience from alexzaitsev ;
Choice of methodology for storage on Vertica by Nikolai Golov azathot . Anchor Modeling (6 NF) is an interesting and unexpected choice for Vertica. Nikolai often shares his practical experience at various events, for example, at Higload ++ or HPE conferences ;
Official on-line documentation with many examples;
A short article on what Vertica lacks, but what many are looking for.

Thank you for your attention. One of the leaders among the analytical databases is Teradata.

Tags: