azhkhanov March 6, 2019 at 11:39

Operating vs analytic databases: column vs row storage

From the sandbox

Databases can be implemented using Excel, GSheet, or using large ORM systems. In my practice of business analytics, I came across different solutions. And since I came to business analysis from finance and audit, each time I met a new system, I asked myself questions - how do they all differ from each other and what tasks do they solve? Found some answers. This article will cover two main purposes of databases:

1 - operations accounting,
2 - data analysis

The first type of tasks is solved by OLTP systems: from O n L ine T ransaction P rocessing. The second type decide OLAP system from O n L ine A nalytical P rocessing

OLTP

The OLTP storage model can be compared to phone book entries. The row in the table is presented as an index and the corresponding data index: (indexN, data). Therefore, such a table cannot be called a table. It is rather an ordinary book, with numbered lines. If you need to write a new operation to the book, add a line, assign an index and close the book. Labels sticking out from the book that allow you to quickly O (log n), find the desired line and do CRUD.

For operations accounting purposes this is a friendly display. But it is not hostile to data analysis, in which it is not the lines themselves that are important to us, but the calculations based on the contents of these lines. And if you make an analytical query based on the contents of the rows, i.e. for non-indexed fields, then such queries will work more slowly.

Indexing all records, as you know, is not an option. Although the book becomes like a table, as attributes become available for quick search, it also slows down the creation of new and updating existing rows. Because these operations will require re-sorting the entire array.

The tradeoff between OLAP and OLTP

In 1C solutions, a compromise is implemented as follows. Events when writing to the database are written to several places at once. In one place, records have few indices and are optimized for OLTP loads; in another place, records are indexed by all fields and adapted for OLAP loads. Such tables are called accumulation registers and information registers. Since writing to several places increases the occupied space by several times, for saving not all transaction attributes fall into the registers, but only those that are considered important for this section of analytical accounting. Such a compromise is called the ROLAP model, i.e. relational analytic mapping.

OLAP

In SAP, the German counterpart 1C went further. The relational OLTP model in this software can be replicated to the OLAP model. SAP HANA implements a storage column structure. This means that the "tables" are stored there not as a set of rows, but as a set of columns.

A similar storage scheme is implemented in such solutions as Google Bigquery, Microsoft SSAS Tabular, Amazon Redshift, Yandex ClickHouse.

The difference between column storage and row storage

If in a row-wise structure the data is stored in the form of "horizontal" tuples, each of which is a transaction:

period, product, department
(Q1, SKU1, 1)
(Q1, SKU2, 1)
(Q1, SKU1, 1)
...
(Q2, SKU1, 1)
(Q2, SKU1, 1)
(Q3, SKU1, 1)
(Q3, SKU1, 1)
...

Then in the column such data is stored "vertically":

(Q1, Q1, Q1, ... Q2, Q2, Q3, Q3, ...)
(SKU1, SKU2, SKU1, ... SKU1, SKU1, SKU1, SKU1, ...)
(1,1,1, ... 1,1,1,1, ...)

Repetitions can be optimized conditionally as follows:

period = (Q1, {start: 0, count: n}, Q2, {start: n+1; count: m}, ...)
product = (SKU1, {start: 0, count: 1}, SKU2, {start: 1; count: 1}, SKU1, {start: 2; count: m}, ...)
department = (1,{start:0, count:m}...)

If there is a column for which such optimization will not reduce the initial volume, then the data is stored in its original form.

The column table engine itself selects the column sorting sequence, but if you know your data and sort them manually, this often increases compression and eases analytical workloads. My compression of individual tables exceeded 300 times. In practice, such a data storage structure:

allows you to compress data to the level when they are placed in RAM, i.e. make in-memory computations available that are not comparable in speed with queries to relational databases
sets its own rules for building a data model, no longer requiring such normalization as in OLTP
defines its semantics for constructing analytic expressions.

The specifics of expressions are described in detail:
here - for Google BigQuery.
here is for Microsoft DAX.

BI as a column base infrastructure

BI is a solution serving analytical load. And they make life much easier if built on top of column databases. This can be a homemade ClickHouse-Grafana-Python bunch or a Google stack bunch: Bigquery-Data Studio-Dataprep-Dataflow or the monolithic Power BI.

Multidimensional cubes are another OLAP alternative to column storage. But for me, MDX expressions, when compared to SQL in BQ or DAX, are redundant and complex.

Tags: