HP Vertica, data warehouse design, big data
UPD: Continuation of the article at the link - habrahabr.ru/company/avito/blog/322510
A year flew by unnoticed as work began on the development and implementation of a data warehouse on the Vertika platform.
The habr already has articles about the DBMS Vertika itself, I especially recommend this one: HP Vertica, the first project launched in the Russian Federation , because its author helped us a lot at the initial stage. Alex, thanks again.
I would like to talk about what methodology was used to design the physical structure of the storage in order to make full use of the capabilities of HP Vertica.
I would like to devote this article to the justification of the optimality of the chosen methodology, and in the next - to talk about what techniques can analyze data containing tens of billions.
Consider a highly loaded site of a large Russian Internet company (now you can - it's Avito;)).
The company's activities are described by the following figures: ~ 10 million active users, ~ 100 million page views per day, about 1 thousand new objects posted by users on the site for 1 minute, ~ 10 thousand user searches per minute.
A rough estimate of the number of actions to be stored in the repository is 100 million new records per day (~ 100 GB of new data per day).
Those. when building a classic data warehouse with the refusal to erase previously received data, the volume of the storage after 3 months of operation will be 10TB of raw data. Big Data as it is.
It is necessary to build a repository that would store at least 6 months of data, allow them to be analyzed, visualized, and lag as much as possible behind real life (as much as possible in the worst case, by one minute in the best case).
Bringing out immediately the question of choosing a platform - the storage should work on HP Vertica, MPP-based storage, see the introductory article in the title.
Now for building repositories, several methodologies are popular:
The designed data warehouse must satisfy the following requirements arising from each other:
Based on the analysis of the above requirements, the “Star” methodology was discarded first. Within the framework of this methodology, no standard mechanisms for non-destructive model expansion are provided. In other words, to add new entities (new dimensions, new facts) to the model, it is necessary to expand the old tables, which means modification of ETL processes.
Thus, with each extension, there is a risk of making a mistake and disrupting the functionality of the old functionality.
In addition, the Star data model has difficulty maintaining historical data. The historicity of measurement data is implemented by implementing slowly changing dimension type 2) However, this approach leads to a multiple increase in the number of rows in the fact table, and it is extremely difficult to realize the historicity of facts in this model.
Inmon's approach is too general, it can be used to create your own methodology ... after all existing, already prepared and tested methodologies are discarded. Data Vault
MethodologyIt looked extremely promising in the context of this task. Data Vault does not provide for dividing tables into facts and measurements; it allows for the independent maintenance of the historicity of any data fields, with the exception of business keys. Historicity is maintained through the implementation of slowly changing type 2 measurements (slow changing dimension type 2, with two dates - from_date, date of the beginning of the data and to_date, date of the end of the data).
Data Vault supports non-destructive model growth. In other words, when identifying new business entities or new relationships between old business entities, the Data Vault storage model can be expanded without making changes to old tables and ETL processes. The old logic will simply continue to work regardless of changes.
Also, Data Vault greatly simplifies the parallelization of ETL processes due to the complete independence of the loading processes of Hubs (entities) and Satellites (entity attributes). Only when loading Links (entity relationships) are dependencies possible that require some ETL processes to run before others.
All the above advantages of the Data Vault model are also relevant for Anchor Modeling, so further decisions were made based on load experiments with industrial data in a Vertica DBMS environment.
The experiments showed the following differences between models based on the Data Vault and Anchor Modeling methodologies:
A brief summary - Anchor Modeling's methodology has been surprisingly suitable for creating Big Data storage on HP Vertica.
The above examples are only part of what we had to deal with, and it almost always turned out that if we didn’t follow the Anchr Modeling path, it would be much more difficult for us.
However, I do not recommend thinking that Anchor Modeling alone will solve all the problems for the architect. The strength and weakness of this methodology lies in the very clear structuredness of all operations. In order for the storage to be developed, the generation of data models, the code for creating database objects and ELT procedures must be fully automated.
We solved this problem with our own Python code, and the metadata description was stored in Excel, as a result, the entire infrastructure was created in a month, and two months after meeting Vertika, the store began to solve the first business problems and reached a volume of 1TB.
Actually, here. If someone is interested in the topic, I plan to talk about other nuances of working with highly normalized storage in HP Vertica, in particular, how to emulate the operation of the Map-Reduce algorithm on top of SQL when normal SQL does not cope.
What is the article about
A year flew by unnoticed as work began on the development and implementation of a data warehouse on the Vertika platform.
The habr already has articles about the DBMS Vertika itself, I especially recommend this one: HP Vertica, the first project launched in the Russian Federation , because its author helped us a lot at the initial stage. Alex, thanks again.
I would like to talk about what methodology was used to design the physical structure of the storage in order to make full use of the capabilities of HP Vertica.
I would like to devote this article to the justification of the optimality of the chosen methodology, and in the next - to talk about what techniques can analyze data containing tens of billions.
Formulation of the problem
Consider a highly loaded site of a large Russian Internet company (now you can - it's Avito;)).
The company's activities are described by the following figures: ~ 10 million active users, ~ 100 million page views per day, about 1 thousand new objects posted by users on the site for 1 minute, ~ 10 thousand user searches per minute.
A rough estimate of the number of actions to be stored in the repository is 100 million new records per day (~ 100 GB of new data per day).
Those. when building a classic data warehouse with the refusal to erase previously received data, the volume of the storage after 3 months of operation will be 10TB of raw data. Big Data as it is.
It is necessary to build a repository that would store at least 6 months of data, allow them to be analyzed, visualized, and lag as much as possible behind real life (as much as possible in the worst case, by one minute in the best case).
Bringing out immediately the question of choosing a platform - the storage should work on HP Vertica, MPP-based storage, see the introductory article in the title.
Methodology Choice
Now for building repositories, several methodologies are popular:
- Firstly, it is Kimbell , and the construction of the repository in the form of a combination of "stars". One of the most popular methodologies, as she is taught at all our institutes, where they read about repositories.
- Secondly, this is Inmon. ... More precisely, not just Inmon, but rather inertia. His approach to designing repositories contains a number of beautiful theses, such as normalization, but does not contain an unambiguous algorithm on how to transform a business model into a data model (into DBMS tables). But there is always a short road - you can take the tables of the source system from which the storage is filled, transfer them to AS IS, modify it a bit and the storage will be. Almost according to Inmon.
- Thirdly, it is Data Vault . A relatively new methodology, but already more or less well-known in Russia, even there is a wikipedia article in Russian. Not a bad thing, there is both an ideology and an algorithm for constructing models.
- Fourth, this is Anchor Modeling . A completely new methodology, sometimes shocking, because involves storing data in compliance with the 6th normal form.
The designed data warehouse must satisfy the following requirements arising from each other:
- Heterogeneity - the repository should accept data from different accounting systems of different nature (relational OLTP base, and NoSQL JSON free structure repository with data about web traffic)
- Flexibility and extensibility - you cannot allow the fixation of the storage data model. The business of the company is expanding, new services, services are being added, new branches are opening. At any time, the repository data model may need to be redesigned to reflect new data types.
- Historicity - it is unacceptable to store only the current state of objects in the store. It is also necessary to maintain a complete history of changes in all attributes of all entities. For example, if a user posted an ad on a company’s site and then changed his name and price three times, the store should allow him to evaluate the popularity of the ad for each time interval between edits.
- Velocity - in the previous sections of the article, estimates of the number of hourly data entering the data warehouse were provided. At the same time, not only new data should be added, but also old records should be updated. Upon receipt of the list of announcements, it is necessary not only to upload new data, but also to check whether the fields that were previously downloaded have changed.
- Volume - the storage should provide a high depth of data storage, which, together with estimates of the speed of data arrival, means that the data storage model must successfully cope with the task of maintaining historicity and flexibility on data volumes of billions of records and tens of terabytes.
Based on the analysis of the above requirements, the “Star” methodology was discarded first. Within the framework of this methodology, no standard mechanisms for non-destructive model expansion are provided. In other words, to add new entities (new dimensions, new facts) to the model, it is necessary to expand the old tables, which means modification of ETL processes.
Thus, with each extension, there is a risk of making a mistake and disrupting the functionality of the old functionality.
In addition, the Star data model has difficulty maintaining historical data. The historicity of measurement data is implemented by implementing slowly changing dimension type 2) However, this approach leads to a multiple increase in the number of rows in the fact table, and it is extremely difficult to realize the historicity of facts in this model.
Inmon's approach is too general, it can be used to create your own methodology ... after all existing, already prepared and tested methodologies are discarded. Data Vault
MethodologyIt looked extremely promising in the context of this task. Data Vault does not provide for dividing tables into facts and measurements; it allows for the independent maintenance of the historicity of any data fields, with the exception of business keys. Historicity is maintained through the implementation of slowly changing type 2 measurements (slow changing dimension type 2, with two dates - from_date, date of the beginning of the data and to_date, date of the end of the data).
Data Vault supports non-destructive model growth. In other words, when identifying new business entities or new relationships between old business entities, the Data Vault storage model can be expanded without making changes to old tables and ETL processes. The old logic will simply continue to work regardless of changes.
Also, Data Vault greatly simplifies the parallelization of ETL processes due to the complete independence of the loading processes of Hubs (entities) and Satellites (entity attributes). Only when loading Links (entity relationships) are dependencies possible that require some ETL processes to run before others.
All the above advantages of the Data Vault model are also relevant for Anchor Modeling, so further decisions were made based on load experiments with industrial data in a Vertica DBMS environment.
The experiments showed the following differences between models based on the Data Vault and Anchor Modeling methodologies:
- Maintaining historicity. Data Vault provides for the implementation of slowly changing type 2 measurements with two dates - the date the data started and the date the data expired. In other words, when a new version of the data arrives, the old version must be updated to set the expiration date of the data. Anchor Modeling provides storage of only one date - the date the data began. This approach allows you to implement the concept of only-insert-ETL - loading data without updates, only with inserts. On ordinary DBMSs, the approach with two dates is slightly slower in the framework of ETL processes (one insert + one update), but much faster in the search for the current version of the data (the presence of two dates allows you to find the current record by independently sorting through the versions and checking the dates for the current date between them). However withinThe situation is changing for MPP databases - the update operation in them is much slower than the insert operation (tens and hundreds of times within the Vertica DBMS). Implementation of window functions functionality from the ANSI SQL 2003 standard (windowing, OVER construction (partition by ... order by ...)) allows you to search for the current version of the data based on only the start date, with virtually no loss of performance compared to two dates. Thus, the inefficiency of the update operation in the Vertica DBMS makes it preferable to maintain the historical data based on the Anchor Modeling methodology.
- Maintaining a high speed of ETL processes with increasing volumes. Data Vault involves storing entity attributes grouped in satellite tables. For example, if an entity has 50 attributes with a comparable probability of changes, the Data Vault methodology recommends storing them in a single table. The Anchor Modeling methodology in this case requires the creation of 50 tables, one for each attribute. At first glance, the Anchor Modeling approach seems redundant, although it satisfies 6 normal forms. The collection of relevant values of all 50 attributes is possible only through the creation of special materialized representations. Without them, collecting attributes is too difficult for the analyst working with the repository. But how different are the approaches in Big Data? As part of the experiment, web traffic records containing 20 columns loaded in parallel into two different structures - the only wide table of 20 columns corresponding to the Data Vault methodology, and 20 maximum normalized narrow tables corresponding to the Anchor Modeling methodology. In the first week after the start of the experiment, the speed of the approaches did not differ significantly, while the Data Vault approach gave the best performance for data analysis (no need to combine 20 tables into one storefront). However, after a month of loading, when the number of records in each table exceeded 5 billion records, the following results were obtained: In the first week after the start of the experiment, the speed of the approaches did not differ significantly, while the Data Vault approach gave the best performance for data analysis (no need to combine 20 tables into one storefront). However, after a month of loading, when the number of records in each table exceeded 5 billion records, the following results were obtained: In the first week after the start of the experiment, the speed of the approaches did not differ significantly, while the Data Vault approach gave the best performance for data analysis (no need to combine 20 tables into one storefront). However, after a month of loading, when the number of records in each table exceeded 5 billion records, the following results were obtained:
• The regular launch of the ETL process for loading new data over the past 20 minutes allowed the introduction of new lines of ~ 5 million.
• Data collection from the source systems, cleaning and enrichment - identical for both methodologies, took about 10 minutes.
• Adding data to narrow tables Anchor Modeling started as 20 independent threads. Each of them started at the same time and completed in a time from 40 seconds to 2 minutes.
• Adding data to a single, wide Data Vault table took 7 to 9 minutes.
• The control run of adding data to a wide Data Vault table, without parallel running Anchor Modeling tables, showed the same numbers, from 7 to 9 minutes.
The experiment showed that with an increase in the number of data rows already stored in the table to 5 billion, inserting into the Data Vault model begins to work ~ 4 times slower than in the Anchor Modeling model.
After a month of storage operation, ETL processes populating a wide Data Vault table took a total of 17 to 19 minutes to load data from external systems in a 20-minute interval (versus 11-12 minutes for the Anchor Modeling table structure).
Further degradation in the performance of ETL processes that populate the wide Data Vault table has compromised real-time storage synchronization with operational data. The storage was beginning to lag behind the combat systems, and required the transfer of old historical data from the wide Data Vault table to the archive. The Anchor Modeling model did not demonstrate such shortcomings. - Maintain manageability and extensibility of storage. Although, it would seem, Vertika is a column storage base, and should be invariant to the number of columns in a table, the above example is far from the only demonstration that many narrow tables are better than one wide one. For example, over time, you might need to repartition a table. Vertical makes this operation simple - creates a copy of the table, and starts repartitioning on it. It would seem that everything is simple. But repartitioning is usually done to erase old historical partitions when disk space runs out. The fact is that erasing data in Vertica is an extremely inefficient operation when it comes to millions of lines - it is possible, when it comes to billions - erasing can last for days, blocking all other processes in the system. In the context of a lack of disk space - space for creating a copy of a table with a large number of columns may simply not be enough. And even if there is enough space, repartitioning such a table will take a day. If instead of one table many narrow columns are used, in 3-4 columns, then even for 50-60 billion rows they will be repartitioned in hours, and the process can be performed in turn, which will simplify the system's search for the required place.
conclusions
A brief summary - Anchor Modeling's methodology has been surprisingly suitable for creating Big Data storage on HP Vertica.
The above examples are only part of what we had to deal with, and it almost always turned out that if we didn’t follow the Anchr Modeling path, it would be much more difficult for us.
However, I do not recommend thinking that Anchor Modeling alone will solve all the problems for the architect. The strength and weakness of this methodology lies in the very clear structuredness of all operations. In order for the storage to be developed, the generation of data models, the code for creating database objects and ELT procedures must be fully automated.
We solved this problem with our own Python code, and the metadata description was stored in Excel, as a result, the entire infrastructure was created in a month, and two months after meeting Vertika, the store began to solve the first business problems and reached a volume of 1TB.
Actually, here. If someone is interested in the topic, I plan to talk about other nuances of working with highly normalized storage in HP Vertica, in particular, how to emulate the operation of the Map-Reduce algorithm on top of SQL when normal SQL does not cope.