ascrus November 12, 2015 at 17:48

New Version of HP Vertica Excavator (7.2)

In late October, a new version of the HP Vertica was released. The development team continued the glorious tradition of releasing BigData construction equipment and gave the code name for the new version of Excavator.

Having studied the innovations of this version, I think the name is chosen correctly: everything that was needed to work with big data with HP Vertica has already been implemented, but now you need to balance and improve the existing one, that is, dig.

You can see the full list of innovations in this document: http://my.vertica.com/docs/7.2.x/PDF/HP_Vertica_7.2.x_New_Features.pdf

I will briefly go over the most significant changes from my point of view.

Licensing policy changed

In the new version, the algorithms for calculating the occupied data size in the license have been changed:

For tabular data, when calculating, 1 byte of the separator for numeric and date-time fields is not taken into account;
For data in the flex zone, when calculating the license size, it is considered as 1/10 of the size of the loaded JSON.

Thus, when upgrading to a new version, the size of your storage license will decrease, which will be especially noticeable on large data storages that occupy tens and hundreds of terabytes.

Added official support for RHEL 7 and CentOS 7

Now it will be possible to deploy the Vertica cluster on more modern Linux OS, which I think should delight system administrators.

Optimized database directory storage

The format for storing a data catalog in Vertica has remained quite the same for many versions. Given the growth not only of the data in the databases themselves, but also of the number of objects in them and the number of nodes in the clusters, it has already ceased to satisfy efficiency issues for highly loaded data warehouses. In the new version, optimization was carried out in order to reduce the size of the directory, which positively affected the speed of its synchronization between nodes and working with it when executing queries.

Improved integration with Apache solutions

Added integration with Apache Kafka:

This solution allows you to organize the loading of streams in real time through Kafka, where this product will collect data from streams in JSON and then load them in parallel into the Flex zone of the Vertica storage. All this will make it easy to create streaming data downloads without involving expensive software or the resource-intensive development of your own ETL workloads.

Also added support for downloading files from Apache HDFS in Avro format. This is a fairly popular format for storing data on HDFS and its support was really really lacking before.

Well, the work of Vertica with Hadoop has become such a constant phenomenon among customers that now there is no need to install a separate package of work with Hadoop in Vertica, it is immediately included in it. Do not forget to remove the old integration package with Hadoop just before installing the new version!

Added drivers for Python

Python now has its own native, full-featured drivers officially supported by HP Vertica to work with Vertica. Previously, developers in Python had to be content with ODBC drivers, which created inconvenience and additional difficulties. Now they can easily and simply work with Vertica.

Improved JDBC driver functionality

The ability to run multiple queries simultaneously (Multiple Active Result Sets) in one session has been added. This allows the session, to build a complex analytical query with different sections, to simultaneously launch the necessary queries, which will return their data as they progress. The data that the session has not yet taken from the server will be cached on its side.

Also added hash calculation functionality for field values, similar to calling the Hash function in Vertica. This allows even before loading the records into the data warehouse table, to calculate which nodes they will be placed by the specified segmentation key.

Enhanced management of the cluster node recovery process

Added functionality that allows you to set the priority of recovery tables for restored nodes. This is useful if you need to balance the recovery of the cluster yourself, determining which tables will be restored among the first and which are better to be restored last.

Added new backup engine functionality

You can back up to local host nodes;
You can restore a schema or table from a full or object backup selectively;
Using the COPY_PARTITIONS_TO_TABLE function, you can organize data storage sharing between several tables with the same structure. After the partition data is copied from table to table, they will physically refer to the same ROS containers of the copied partitions. With changes in these partition tables, each further will have its own version of the changes. This makes it possible to make snapshots of table partitions into other tables for their use, with the guarantee that the original data of the original table remains intact, at high speed, without the cost of storing the copied data on disks.
With object recovery, you can specify the behavior when the restored object exists. Vertica can create it, if it is not already in the database, not restore it, if there is one, recreate it from the backup, or create a new object next to the existing one with a prefix in the name of the backup name table and its date.

Improved optimizer performance

When joining tables using the HASH JOIN method, the join processing process could take quite a lot of time if both joined tables had a large number of records. In fact, it was necessary to build a hash table of values on the external join table and then scan the internal join table to look for the hash in the created hash table. Now, in the new version, scanning in the hash table is made parallel, which should significantly improve the speed of joining tables using this method.

For query plans, it is possible to use script hints in the request to create scripts of query plans: indicate the explicit order of joining tables, the algorithms for joining and segmenting them, and list projections that can or cannot be used when executing a query. This allows more flexibility to get from the optimizer to build effective query plans. And in order for BI systems to be able to take advantage of such optimization when performing typical queries without the need to enter hint descriptions, Vertica adds the ability to save a script for such queries. Any session performing a request matching the template saved to the template will receive the optimal request plan already described and work on it.

To optimize query execution with multiple calculations in computed fields or conditions, including like, Vertica adds JIT compilation of query expressions. Previously, the interpretation of expressions was used and this greatly degraded the speed of query execution, in which, for example, there were dozens of like expressions.

Extended data integrity check functionality

Previously, when describing restrictions on tables, Vertica checked only the NOT NULL condition when loading and changing data. All PK, FK, and UK restrictions were checked only with single DML INSERT and UPDATE statements, as well as for the MERGE statement, the operation algorithm of which directly depends on the integrity of PK and FK restrictions. It was possible to check for violation of the integrity of the values of all restrictions using a special function that issued a list of records that violate the restrictions.

Now in the new version it is possible to include checking all restrictions for group DML statements and COPY on all or only the necessary tables. This allows you to more flexibly implement checks on the cleanliness of the downloaded data and choose between the speed of loading the data and the ease of checking its integrity. If the data in the warehouse comes from reliable sources and in large volumes, it is reasonable not to include a check of restrictions on such tables. If the amount of incoming data is not critical, but their purity is in question, it is easier to enable checks than to independently implement a mechanism for their checks on ETL.

Deprecated Announcement

Alas, any product development always not only adds functionality, but also gets rid of outdated. In this version of Vertica, not so much has been declared obsolete, but there are a couple of significant ads worth thinking about:

Ext3 file system support
Pre-join projection support

Both points are critical enough for Vertica customers. Those who have been working with this server for a long time can easily have clusters on the old ext3 fs. And I also know that many people use pre-join projections to optimize queries for constellations. In any case, an explicit version of the removal of support for these functions is not indicated, and I think that Vertica clients have time to prepare for this for at least a couple more years.

Summing up the impressions of the new version

This article lists only half of what was added to Vertica. The volume of expansion of functionality is impressive, but I have listed only what is relevant for all data warehousing projects. If you use full-text search, geolocation, advanced security, and other cool features implemented in Vertica, then you can read all the changes on the link that I gave at the beginning of the article or documentation on the new version of Vertica:
https: //my.vertica .com / docs / 7.2.x / HTML / index.htm

On my own behalf I will say: working with high-capacity data repositories on HP Vertica in tens of terabytes in different projects, I evaluate the changes in the new version very positively. It really does a lot of what I would like to receive and facilitate the development and maintenance of data warehouses.

Tags: