am-habr July 11, 2019 at 18:32

Data quality in storage

From the sandbox

The quality of data in the repository is an important prerequisite for obtaining valuable information. Poor quality leads to a negative chain reaction in the long run.
First, confidence in the information provided is lost. People are starting to use Business Intelligence applications less, the potential of applications remains unclaimed.
As a result, further investments in the analytical project are called into question.

Responsibility for data quality

The aspect associated with improving data quality is a big deal in BI projects. However, it is not the privilege of only technical specialists.
Data quality is also influenced by aspects such as

corporate culture.

Are workers themselves interested in producing good quality?
If not, why? There may be a conflict of interest.
Maybe there are corporate rules that define those responsible for quality?

The processes

What data is created at the end of these chains?
Maybe the operating systems are configured so that you need to "get out" to reflect this or that situation in reality.
Do operating systems perform data validation and verification themselves?

Everyone in the organization is responsible for the quality of data in the reporting systems.

Definition and meaning

Quality is a confirmed satisfaction of customer expectations.

But the quality of the data does not contain a definition. It always reflects the context of use. The data warehouse and BI system perform different purposes than the operating system, where the data is taken from.

For example, on an operating system, a client attribute may not be a required field. In the repository, this attribute can be used as a dimension and its filling is mandatory. Which, in turn, introduces the need to populate with default values.

Data warehouse requirements are constantly changing and they are usually higher than operating systems. But it can be the other way around, when it is not required to store detailed information from the operating system in the storage.

To make data quality measurable, its standards must be described. People who use information and numbers for their work should be involved in the description process. The result of this involvement can be a rule, following which, at a glance, you can say whether there is an error or not. This rule needs to be issued in the form of a script / code for subsequent verification.

Data quality improvement

It is impossible to clean and fix all hypothetical errors in the process of loading data into the repository. Good data quality can only be achieved through the close work of all participants. People who enter data into operating systems should find out what actions lead to errors.

Data quality is a process. Unfortunately, in many organizations there is no strategy for continuous improvement. Many limit themselves only to saving data and do not use the full potential of analytical systems. As a rule, when developing data warehouses, 70-80% of the budget is spent on data integration. The process of control and improvement remains unfinished, if at all.

Instruments

The use of software tools can help in the process of automating the improvement and monitoring of data quality. For example, they can completely automate technical verification of storage structures: field format, default values, compliance with the requirements of table field names.

It may be more difficult to check the contents. As storage requirements change, data interpretation may change. The tool itself can turn into a huge project that requires support.

Tip

Relational databases, in which repositories are usually designed, have a great opportunity to create views (views). They can be used to quickly check data if you know the features of the content. Each case of finding an error or problem in the data can be recorded in the form of a query to the database.

Thus, a content knowledge base will be formed. Of course, such requests should be fast. As a rule, servicing views takes less human time than tools organized on tables. The view is always ready to display the result of the check.
In the case of important reports, the view may contain a column with the addressee. It makes sense to use the same BI tools to report on the status of data quality in the repository.

Example

The request is written for the Oracle database. In this example, tests return a numeric value that can be interpreted as needed. The values T_MIN and T_MAX can be used to adjust the degree of alarm. The REPORT field was once used as a message in a commercial ETL product that did not know how to adequately send emails, so rpad is a “crutch”.

In the case of a large table, you can add, for example, AND ROWNUM <= 10, i.e. if there are 10 errors, then this is enough for alarm.

CREATE OR REPLACE VIEW V_QC_DIM_PRODUCT_01 AS
SELECT
  CASE WHEN OUTPUT>=T_MIN AND OUTPUT<=T_MAX
  THEN 'OK' ELSE 'ERROR' END AS RESULT,
  DESCRIPTION,
  TABLE_NAME, 
  OUTPUT, 
  T_MIN,
  T_MAX,
  rpad(DESCRIPTION,60,' ') || rpad(OUTPUT,8,' ') || rpad(T_MIN,8,' ') || rpad(T_MAX,8,' ') AS REPORT
FROM (-- Test itself
  SELECT
    'DIM_PRODUCT' AS TABLE_NAME,
    'Count of blanks' AS DESCRIPTION,
    COUNT(*) AS OUTPUT,
    0 AS T_MIN,
    10 AS T_MAX
  FROM DIM_PRODUCT
  WHERE DIM_PRODUCT_ID != -1 -- not default value
  AND ATTRIBUTE IS NULL ); -- count blanks

The publication used the materials of the book
Ronald Bachmann, Dr. Guido Kemper
Raus aus der BI-Falle
Wie Business Intelligence zum Erfolg wird

Tags: