Advanced Physical Modeling Techniques in Teradata

  • Tutorial
Continuing the publication of materials on physical modeling techniques in the Teradata DBMS, we, as promised in the previous article, want to talk about additional techniques that are not directly related to ordinary indexes (although in this article we will touch on special types of indexes that you should be aware of).

In addition to the usual indexes, Teradata DBMS has a number of specific and sometimes subtle techniques that can be used in physical modeling. Here is a description of most of the techniques, and if you have questions or want to know more, we will be happy to answer in the comments.

Join index

This is a special kind of index, which is a stored and updated result of an SQL query. An index has such a name because a query can include a join of tables. But there may not be table joins — in this case, the index is called the Single Table Join Index. This technique should be considered in the presence of frequent queries to join several tables and / or perform aggregations. Having received an SQL query from the user, for which it is possible to use only the Join Index data, the optimizer will most likely prefer this method. There are also ways to let the optimizer understand how to join Join Index'a and the base table if, for example, the index does not have enough fields that the user requested.

Indexes can be limited not only by width (the number of columns in the base table), but also by depth. When creating a Join Index, you can specify the WHERE clause, and then only the necessary data will be included in the index. This method is suitable for optimizing queries that have very specific filtering conditions. Then they can be included in the index and thereby reduce its volume and minimize the resources spent. But we must remember that this index will be considered by the optimizer only when processing those queries that have the same (or more stringent) filtering conditions as when creating the index.

In addition, since Join Index is almost the same as a table, you can build indexes on it and collect statistics.

Hash nusi

By default, the NUSI index is distributed among AMPs in the same way as the base table on which it is built. This gives a uniform distribution of the index among AMPs and ease of use. Inside each AMP, NUSI is sorted by hash value from all columns included in it. But it is possible to create an index so that it is sorted by the hash value of a specific field in the index. Such an index will significantly optimize query performance with equality conditions. The value compared with the index column is hashed, and this hash is quickly searched in the index (which is sorted by it).

Value Ordered NUSI

But in addition to sorting by hash, you can set the sorting of the index by the value of a specific field. Such indexes can optimize queries that have ranges or inequality conditions as filter conditions. When creating such an index, it is sorted by the given field and saved as such to the disk. I think the benefits of finding ranges on a sorted list do not need comments.

Intentional “skew” of data

The Shared Nothing MPP system delivers the best performance with the most even distribution of data. This issue has received much attention in this and other articles. But there are times when it makes sense to create a “skew” of data intentionally, and this will positively affect performance.

For example, you have a system with 144 AMPs. And there is a small reference book in which, say, 200 records are distributed more or less evenly. When you select data from a directory, all 144 AMPs turn to their disks to extract one or two records each. Does it make sense to strain so many AMPs for such a trifling operation? In such situations, you can create an artificial “skew” of the data - for this you need to add a blank column to the directory, making it a Primary Index, and fill it with the same value. In this case, when accessing the table, only one AMP will work, which will quickly extract all the records and redistribute between all other AMRs. This approach cannot be recommended for use in all such cases, just keep in mind that this can be done.

Referential integrity


Teradata, like other relational databases, provides support for referential integrity. But Teradata has a number of features that we want to mention. The first feature is that the columns of the referenced parent table need not be declared as a Primary Key. Teradata only requires that these columns form a unique index, primary or secondary. Although the definition of PK remains an affordable option because any restriction on Teradata's uniqueness physically translates into a unique index.

Other features relate to individual types of referential integrity.

Standard ri

Standard referential integrity check. Executed for each inserted, modifiable, or deleted row. Added by the following command:
ALTER TABLE Employee ADD CONSTRAINT fk1 FOREIGN KEY (Dept) REFERENCES Department (Dept);

The peculiarity of the implementation of this type of referential integrity in Teradata is that when it is added to existing tables, even in case of violation of the conditions, a restriction will be created and activated (you cannot perform operations that violate referential integrity). In this case, Teradata will automatically create an error table with the name of the child table, supplemented by the suffix "_0", in which it will place records of the child table that at the time of inclusion of the referential integrity did not meet its conditions. In the future, the user is responsible for solving problems with these relationships and deleting the error table. This implementation allows, without waiting for the resolution of table linking problems to enable and activate referential integrity, it is easy to detect the presence of problems (there is an error table - there are problems),

Batch ri

The implementation of this type of referential integrity in Teradata does not have any features. It is performed upon completion of the DML command execution; violation detection leads to rollback of the whole transaction; violations at the time of creation lead to an error. Added by the following command:

ALTER TABLE Employee ADD CONSTRAINT fk1 FOREIGN KEY (Dept_Number) 	
REFERENCES WITH CHECK OPTION Department (Dept_Number);


Soft ri

This type of referential integrity is specific to Teradata. Any of the types described above, while helping the optimizer, at the same time increases the load on the system by honestly performing checks. Soft RI can be called “trust referential integrity”, because adding it to tables does not lead to the inclusion of a data linkage control mechanism. Creating this restriction will only add information to the data dictionary about how tables A and B are related to each other. Why do you need this, you say. With good data quality, you’ll give more information to the optimizer and exclude costly referential integrity checks. Added by the following command:
ALTER TABLE Employee ADD CONSTRAINT fk1 FOREIGN KEY (Dept_Number) 	
REFERENCES WITH NO CHECK OPTION Department (Dept_Number);


What does RI give to the optimizer?

Наличие определенного для таблиц ограничения ссылочной целостности (любого из указанных выше типов), позволяет оптимизатору проводить Join Elimination, то есть не выполнять соединение таблиц при выполнении запроса, не содержащего в SELECT или WHERE части колонки родительской таблицы, например:
У нас есть View
REPLACE VIEW	EmpDept 
AS SELECT	Employee_Number, Last_Name, First_Name, E.Dept_Number, Dept_Name
FROM		Employee E 
INNER JOIN 	Department D
ON 		E.Dept_Number = D.Dept_Number;

, где Department – родительская, а Employee – дочерняя таблицы, между которыми определена ссылочная целостность.

При выполнении следующего запроса соединение выполняться не будет, т.к. он не обращается к колонкам родительской таблицы:
SELECT 	Employee_Number, Last_Name, Dept_Number  
FROM 	EmpDept;

Примечание: при использовании Soft RI будьте уверены в качестве своих данных, иначе исключение соединения может приводить к получению некорректного результата.

Сжатие данных


As mentioned above, one of the main tasks of optimizing a physical data model is to reduce the number of input / output operations. If compression is applied to the data, it takes up less disk space and is read for fewer block reads. But after that, the data will have to be decompressed, which will require additional resources.

Data compression can be implemented both software and hardware. We will not consider hardware compression as part of the topic of physical design; we will dwell on the software in more detail.

It should be noted that applying compression makes sense for those tables that occupy a really significant amount. If the table is already small, you are unlikely to notice the serious benefits of using compression.

Null Compression

We agree immediately - in this section, an empty value means only NULL. Philosophical debate that NULL is “nothing” and not an empty meaning is welcome, but in the comments.

If you have a table in which there is a large percentage of NULLs in fixed-size columns, then, given the large number of entries in the table, it makes sense to think about compressing NULLs. This is done very simply - in the DDL code for creating the table (or ALTER'e) after the column attributes the word “COMPRESS” is indicated. Thus, the system will analyze the table and put a special bit for all columns where NULL is present, meaning: "it was NULL, but we squeezed it, and it was gone."

Compression of specific values

It so often happens that in addition to NULLs in the tables there are often found values. For example, it may be a surname leading from a statistical point of view - say, Ivanov. Any field (especially a string) for which statistically frequently encountered values ​​can be distinguished is a good candidate for compression. There may be exceptions - for example, you decide to keep the status of the client in text (do not ask why). The client table for HDs is usually tens of millions of records. Having just a limited set of statuses: “Active”, “Inactive”, “Blocked”, you can apply compression and reduce costs for tens of millions of records (in our example, from VARCHAR (16) to two bits).

The list of values ​​for compression can be specified for each column separately. Such lists are stored in the table headers, and when accessing the table, the compressed values ​​(flags) are replaced with real values. The additional load from this operation is so small that it can be neglected.

When designing models, pay attention to data demographics, which have already been mentioned several times in this article. If the table contains millions of records and some values ​​can be compressed, then by doing this, you will speed up queries to the table.

Block Level Compression (BLC)

Block-level compression, or BLC, is a type of compression that applies to entire blocks of data before they are written directly to disks. This type of compression can be applied either for the entire system or for individual tables. With this type of compression, the gain in disk space comes from increasing the load on the processor, so in this case it is important to understand how this suits you.

After you enable this type of compression for the table, you can no longer disable it, because it affects the physical appearance of the blocks, you will have to overwrite the table aside, but without compression, and then completely replace it.

Column compression

As we described in one of the previous articles ( Column and hybrid storage of records in the Teradata DBMS), Teradata DBMS supports both column storage of records, and lowercase. When designing a physical data model, you need to take the best from each of the options. When using column storage of the table, the use of automatic compression is possible. Its algorithm is based on the fact that if a data block contains values ​​of one column, then the probability of their compression is quite high (especially if these are non-unique values). The values ​​of one column are compressed at the container level (see the article above). If the system determines that compression does not give a gain for a given value, then it simply does not compress it, but if it sees that it makes sense to compress, it checks which of the built-in compression algorithms will give the greatest gain and uses it. For example, if you decide to enable automatic column compression in the "balance" field,

For the creator of the model, this means only one thing: if you decide to make a table with a column type of storage, then this makes sense to you. And if so, then analyze the demography - perhaps for a number of columns you need to enable automatic compression.

What to choose?

The table below will help you decide on the type of compression.
 Compression of specific valuesBlock compressionColumn compression
Ease of use Easy to apply for well-studied data. Turned on and forgot Turned on and forgot
Need an analysis?Need to analyze data demographics It is necessary to analyze the gain in space in exchange for CPU costs It is necessary to understand for which columns of the table to include compression, and for which it will not make sense
Flexibility Works for a large number of situations and data types. Combination with other types of compression is possible. It can be applied only to tables with column storage.
Influence Minimal impact on CPU consumption Affects CPU For each line, it is determined whether it was compressed or not.
Application area Replacing specific values Compress all data Compress a set of columns from the entire table

Physical Design Result


The result of physical modeling work should be a set of DBMS objects, which, with minimal system resources, will provide the required performance for the main user load:
  • First of all, these are tables (including storage attributes, such as, for example, compression parameters) and primary indexes (including sectioning parameters), which do not require additional expenditure of system resources. Ideally, primary indexes should cover the basic need for access to the rows of the tables for which they are created, and to maximize the use of scripts to locally join tables on AMPs. The creation of primary indexes is part of the mandatory physical design program for the Teradata platform.
  • Secondary indexes designed to provide alternative access paths to the required table rows. Unlike primary indexes, the system has to spend additional resources on their support, therefore, when deciding on their creation, it is necessary to weigh the benefits of their use and the expenditure of resources on their support. Unused secondary indexes must be removed.
  • Thirdly, these are other objects that need to be created to achieve the desired performance. Such objects include, for example, various types of join and hash indices.

Throughout life, the physical model should undergo regular revisions. This is due not only to changes in the demographics of data, but also to changes in the nature of the load, the expansion of the model, etc.

To summarize


An attentive reader has already understood that the process of physical design of a data model is a process of preparation (before the implementation of the model) and a process of refinement (improvement after implementation). You can not 100% guess with the load and data, but at 50% it is possible. Next time it will be possible for 60%, and over time, it will probably turn out for 80%. Thus, tapping will take less and less time. The skill of designing physical models increases with experience, therefore, the more experiments and projects you implement, the better you will imagine how to work with models in the future.

Learn the basics and don't be afraid to experiment. We hope that the Teradata DBMS operation mechanisms described in this article with data, the criteria for choosing indexes, and physical modeling techniques will be useful for specialists already working with our DBMS and will cause interest for those who have yet to.

And, as always, we are ready to answer in comments any questions you may have.

Also popular now: