Column and hybrid storage of records in the Teradata DBMS

    Until recently, all DBMSs working with structured data (and not only them) could be divided into 2 categories: storing records in a row-wise format and storing records in a batch format. This is a fundamental difference that affects how table rows look at the level of internal DBMS storage mechanisms. For a long time, Teradata DBMS belonged to the first group, but with the release of the 14th version, it was possible to determine how to store the data of a particular table - in the form of columns or rows. Thus, hybrid storage appeared. In this article we want to talk about why this is needed, how it is implemented and what benefits it provides.

    What is a Teradata Columnar?


    Before we talk about the storage format for columns, let's say a few words about how we usually store data in rows. Take a relational table in which there are columns and rows:

    How do we write the data of this table to disk in the case of its row format? First, write the first line, then the second, third and so on:


    How to minimize the load on the disk system when reading this table? You can use different methods of access to it:
    1. Index access - if you need to read only a few lines.
    2. Access to its individual partitions (sections) - if the table is very large (for example, transactions for several years are shown, but you need to read data only for the last few weeks). These are partitions per line .
    3. Full reading of the table - if you need to read a large percentage of the number of its rows.

    That is, minimizing the load on the disk system is based on the fact that we need to read not the entire table, but only its individual rows.

    What about the speakers? If the SQL query does not use all the columns of the table, but only some of them? When reading lines, we read every line from the disk completely. If the table has 100 columns, and a specific SQL query needs only 5 of them, then we are forced to read 95 columns from the disk that the SQL query does not use.

    It is here that the idea comes to mind to store data not in rows but in columns. In the case of the storage format for columns - columnar - the approach is this: first write to disk the first column, then the second, third and so on:


    Such a partition of the table into columns creates partition by column. If the query needs only separate columns, then we read from the disk only the necessary partition by column, significantly reducing the number of I / O operations when reading the data that the SQL query needs.

    An interesting feature of this approach is that partition by row and partition by column can be used simultaneously in the same table, in other words, partitions within partitions. First we go to the necessary partitions in columns, then inside them we read only the necessary partitions in rows.

    Summarizing the above, Teradata Columnar is a method of storing data in the Teradata DBMS, which allows tables to use two partitioning methods simultaneously:
    • Horizontal partitions - line by line
    • Vertical partitions - in columns

    Benefits of Teradata Columnar


    The benefits are as follows:
    • Increased query performance - by reading only individual partitions in columns, eliminating the need to read all the data in the rows of the table. This is exactly where we started our article.
    • Efficient automatic data compression using the automatic compression mechanism. But this is an additional pleasant feature that opens when storing data in columns: in this case, data is much more convenient to compress. Some even put this advantage in the first place, and quite reasonably.

    Both of these items reduce the load on the disk system. In the first case, we read less data, so we make fewer I / O. In the second case, the table takes up less disk space, and as a result, less I / O is required to read it.

    Reducing the load on the disk system reduces the response time in applications, as requests are faster. And the overall performance of the Teradata system is also increasing - individual requests consume a smaller percentage of the system capacity for I / O operations, so the system can perform a larger number of such requests.

    Reading data from a table stored in columns


    Storage of data by columns significantly changes the mechanics of reading data from such a table. In fact, we need to “collect” data from separate partition of columns in order to get the rows that the SQL query should return.

    It looks as follows. If the query has a WHERE clause for some column, then first we scan this column. We filter the data, and for those rows that satisfy the WHERE clause, go to neighboring partitions by columns and read the values ​​of the remaining columns for these rows.

    Example: Which clients live in Sochi?
    In the row format (for comparison), the values ​​of all columns are read out when filtering by the City column (WHERE City = 'Sochi' condition).


    In the column format, only City and Cust column data is read. Number - first filter by the City column, then search for the corresponding Cust values. Number


    If there are several columns in the WHERE clause, first the most selective of them is selected, the filter by which will lead to the greatest clipping of the rows, then the next WHERE column, etc. And then all the other columns involved in the request.

    As for the mechanics of the transition between partitions of columns for “assembling” one row from separate columns, such a transition is carried out “positionally”. When reading a specific row, we know that in each column partition, the value of the row we need is at the Nth position relative to the beginning of the table. To search for a string, a special rowid address structure is used: rowid, which contains the partition number and row number. This allows you to switch between partition partitions, replacing the partition number inside rowid for the same row number. Thus, we collect the column values ​​for a single row together.

    Hybrid data storage in one table


    And if there are a lot of columns in the table? Then more overhead is needed to collect the rows from the individual columns. This effect can be avoided by creating separate partitions not for each column, but for groups of columns. If we know that some columns are often used together and rarely separately, then we put these columns in one partition. Then, inside such a partition, the data of these columns can be stored in rows, as if it were a subtable. This storage method is called hybrid: part of the data in the table is stored in columns, and part of the data in the same table is stored in rows.
    Example (different table than before):


    Only let us specify that the term "hybrid storage" can be used not only for the above storage method, but also for the storage of data blocks on the disks of varying speed, as we have on Habrahabr was a separate article .

    Disk space saving


    We talked about the exclusion of partitions in a bit more detail than about data compression. Let us give him due attention.

    Partition by column means that the values ​​of an individual column are next to each other. It is very convenient for data compression. For example, if a column has slightly different values, then you can create a “dictionary” of frequently used column values ​​and use it to compress data. Moreover, there can be several such dictionaries for one column — separately for different “containers” into which the column is split when it is stored on disk (by analogy with data blocks when stored in rows).

    What's convenient - Teradata selects dictionaries automatically based on the data that is loaded into the table, and if the data changes over time, then the dictionaries for data compression also change.

    In addition to dictionaries, there are other data compression methods, such as run length encoding, Trim clipping, Null compression, Delta storage, UNICODE in UTF8. We will not go into the details of each of them - we will only say that they can be used both individually and in combination with each other for the same data. Teradata can dynamically change the compression mechanism for the column if it produces the best result.

    The purpose of compression is to reduce the amount of data (in gigabytes) that the table occupies on disk. This allows you to store more data (in terms of the number of rows in the table) on the same hardware, as well as more data on fast SSDs, if any.

    When to use Columnar?


    Teradata Columnar is a great feature that allows you to increase query performance and compress data. However, you should not consider it as an ideal solution. The gain will be obtained only for the corresponding data and SQL queries with certain characteristics. In other cases, the effect may even be negative - for example, when all queries use all the columns of the table and the data is not very well compressed.
    Good candidates for column storage are tables that have many columns, but each SQL query uses a relatively small number of columns. In this case, different queries can use different columns, the main thing is that each individual query uses only a small number of columns. In these cases, there is a significant reduction in the number of I / O operations and improved I / O performance.

    Dividing row data into separate columns when inserting data (insert) and then “collecting” column values ​​back to rows when selecting data (select) - these operations consume more CPU for column tables than for regular ones. Therefore, it should be borne in mind that if the system has a significant shortage of CPU resources (the so-called CPU-bound systems), then Teradata Columnar should also be used for them with care, as this can reduce the overall system performance due to lack of CPU resources.

    One of the requirements for column tables is that the data is loaded into the table in large batches of INSERT-SELECT. The reason is this: if you insert data “line by line”, then for column tables this is very inconvenient, because instead of a single row entry (one I / O operation), as it would be for a regular table, you need to write the column values ​​in different columns separately - drastically increasing the number of I / O operations. However, this is typical for inserting just one line. If many rows are inserted at once, then recording a large set of rows is comparable in terms of complexity in order to split the same amount of data into columns and write these columns to the desired partition by column.

    For the same reason, UPDATE and DELETE operations are time-consuming in column tables, because you need to go to different partition of columns to perform these operations. However, laborious - this does not mean impossible. If the volume of such changes is relatively small, then Teradata Columnar is quite suitable for such tasks.

    Another feature of column tables is the lack of a primary index - a column, which ensures the distribution of rows by AMPs in the Teradata system. Primary index access is the fastest way to retrieve rows in Teradata. There is no such index for column tables, the No-Primary Index (No-PI) mechanism is used instead - when the data is evenly distributed by Teradata itself, but without the possibility of accessing this data by the primary index. This means that column tables should not be used for tables whose specific usage implies a primary index.

    These rules are not unambiguous. Each situation should be analyzed separately. For example, you can create additional indexes to mitigate the effect of the lack of a primary index. Or another example: you can create a table in row-by-row format, and on top of it is a join index in column format (join-index is a materialized view in Teradata, it can also be stored in column format). Then queries that use many columns will refer to the table itself, and queries that have few columns will use a materialized view.

    What is convenient - it is not necessary to declare all tables columned. You can create only part of the tables in the format of columns, and the rest of the tables in the format of rows.
    Summarizing the above - Teradata Columnar is used for tables that have the following properties:
    • SQL queries are executed on separate sets of table columns
      OR
      SQL queries are executed on a separate subset of table rows
      > The best result is when both
    • Data can be loaded with large INSERT-SELECTs
    • No or few update / delete operations


    If you look closely, many very large tables that grow over time, just have these properties. Do you want them to take up less disk space and query them faster? Then the functionality of the Teradata Columnar is worth paying attention to.

    To help those who perform physical modeling, determine which tables should or should not be created in the format of columns, there is a special tool - Columnar Analysis Tool. This tool analyzes the use of a particular table by SQL queries and gives recommendations on the applicability of the storage format for columns. There is also a lot of useful information, of course, in the Teradata 14 documentation .

    Also popular now: