Data Warehouse Metrics

From the sandbox

Greetings.

Creating, or even maintaining, an existing data warehouse, inevitably occurs such a stage when the multiplicity of user desires is met with the inevitability of physical limitations of the DBMS used for the storage. In fact, no one can have infinite disk space, processor power, or an arbitrarily long time to update data.

At this point, the management may have questions if they did not arise earlier, what exactly takes up so much space in the database, why the download has not yet ended, and so on.

To know what to answer, it is necessary to conduct an account. Creating a CD is a long process, people who developed the architecture can already be far away, I'm not talking about the fact that business requirements are changing, sometimes, as quickly as new versions of the Firefox browser are released .

When my customer contacted me with a question about how much more data we can store and process in the current CD, I could only say that I can approximately answer, but I would have to calculate everything beforehand, i.e. Collect data warehouse metrics.

I tried to prepare a certain set of basic metrics, which it is desirable to calculate first of all, and based on which, you can get derived metrics, which, in turn, could be used for forecasting.

After this kit was received, I decided to google this topic and found an article by Bill Inmon with a list of metrics that would be nice to have counted in most CDs.

Start of transfer

Everything has metrics. There is a speed limit on the road, people have weight, days have temperature, there is a tachometer in cars.

Metrics help us organize our reasoning and make meaningful comparisons. Even if we don’t mean metrics as such, we can compare events and the conditions for their manifestation. And data warehouses are no exception in this case. We must have measurable characteristics if we are going to compare data warehouses of different companies. It is part of human nature to have some criteria by which one could tell about oneself or one's work and compare them with the merits of others.

Given the above, the list of metrics for the data warehouse system could be as follows:

Size

In tables XD.
In indexes HD.
Volume of the conversion area (volume of temporary tables and / or spool).

ETL

Number of ETL programs.
Start frequency.
The amount of data processed by the ETL.
Place of execution of ETL programs.

Datamarts

The number of datamarts in the HD.
The amount of data in datamarts.
Frequency of data replenishment.

Detailed HD Data

Data access speed (bytes per second).
The size of the physical record.

HD structures

The number of tables.
The number of rows in each table.
The average row size for each table.

Export data from HD

The number of rows per table leaving the HD.
Export frequency per table.
Export criteria.

HD Requests

The number of processed requests per day.
The average amount of data for each request.

Story

The period of time during which the HD stores a history.

Other

Using mining date.
Adaptive datamarts.
Decision Support Applications (DSS).
ERP applications.
ODS.
Using tape or other type of long-term information storage
Archive storage.

Here such metrics could be in order to measure your HD.

Of course, this list can be expanded and modified in a variety of ways. One of the possible improvements is the addition of time to these metrics, since it is better to follow the values for a certain period of time. So, you can calculate, for example, in addition to the size metric, also an indicator of how much the size has changed.

Or you can split some tables by company department. It often makes sense to group the data together, even if such grouping is not provided in the physical database tables. Of course, you can also show which DBMSs are used for which tables, for example:

Table ABC - Teradata
DEF table - IBM UDB
Etc.

So, who can use the HD metrics?

DB admins.
Architects DB.
DB developers.
HD users.
Attracted administrators.
System programmers
And many others ...

End of translation

I selected 4 groups of metrics, considering that such a set would be good for determining the need to collect metrics specifically in our project of the multi- tenant Data Warehouse ( Multitenant DWH ). These groups are:

Metric by the number of rows in the tables.
Space metric (number of bytes) for each circuit. Data in the repository is located in several Oracle DBMS schemas.
Metric for the data loading process (ETL).
Metric on the largest fact table in HD.

Metrics from each group were collected on a specific date and for each individual owner using PL / SQL procedures that run after each loading process.

Let's consider each group in more detail.

Row Metrics

In this group of metrics, information is collected on how many rows for each owner were in the CD schemas for each moment in time when these metrics were calculated.

Using these indicators, you can also determine the relationship between different layers of CD, how many tables make up detailed data, datamarts, and the area of the stage. When a new table appeared, or how many, on average, directories per one fact table.

It also makes possible a quantitative analysis for different owners according to the contents of each specific table.

The calculation was carried out using the standard approach - generating and running SQL code in a sampling cycle of all tables from a given list of schemas.

Volume metrics

For the metrics of the volume occupied by the HD tables, I decided to calculate the data space exclusively occupied by the data reported by the DBMS:

SELECT owner, to_date(v_cur_date, 'RRRR-MM-DD'), sum(bytes) from dba_segments 
      where owner in ('список схем через запятую')
      groupby owner

In addition to this metric, I want to add also the calculation of the space occupied by the data on the disk. By comparing both indicators, you can get, for example, the ratio of ETL efficiency in terms of disk space usage.

ETL Metrics

The ETL tool that is used for download is Oracle Warehouse Builder . The script shown below collects execution statistics for a given loading process, aggregating the number of processed rows from subordinate mappings.

select ae.top_level_execution_audit_id, ae.execution_name, ae.created_on, ae.elapse_time, dt.Selected, dt.Inserted, dt.Updated, dt.Deleted
from owbsys.All_Rt_Audit_Executions ae,
 (
  selectsum(coalesce(au.number_records_selected, 0)) as Selected, sum(coalesce(au.number_records_inserted, 0)) as Inserted, sum(coalesce(au.number_records_updated, 0)) asUpdated, sum(coalesce(au.number_records_deleted, 0)) as Deleted
from OWBSYS.ALL_RT_AUDIT_MAP_RUNS au where au.execution_audit_id in (select execution_audit_id from owbsys.All_Rt_Audit_Executions ae where ae.top_level_execution_audit_id = <ИД последнего запуска>)
) dt
where ae.top_level_execution_audit_id = <ИД последнего запуска>

If you want to track the performance of each individual mapping, you can expand the set of collected data.

Largest Table Metrics

The largest fact table consists of transactions performed once a month for each active employee of the owner. The following script:

select p_Customer_Num, t.month_start_date, count(distinct f.Person_Id) as uniq, count(f.Person_Id) as t_cnt  
from dm.time_dimension t leftouterjoin dm.trans_fact f on f.time_id = t.id and f.customer = p_Customer_Num
where month_start_date between p_Start_Date and p_End_Date
groupby month_start_date;

allows you to collect data on the number of rows for each data owner for each month, plus the number of unique employees for each month.

This metric is the point around which you can connect the remaining collected metrics in our BI tool and get, as a result, the following derived indicators:

Average cost per seat (in GB) of adding a new owner for every 1000 employees.
The average cost of time (in hours) to add a new owner for every 1000 employees.
The average cost for a row-oriented database management system (in millions of rows) to add a new owner for every 1000 employees.

You can also create a bunch of beautiful graphs showing the average daily load time, the dependence of this time on the number of owners processed, etc. Compare metric data after changes in ETL, improvements or deteriorations in the DBMS, and the like, which will help to more objectively evaluate the results of improvements in the process of building CDs and answer the questions with which I started this article.

Tags: