mrz0diak July 10, 2012 at 23:08

Join historical tables

From the sandbox

From time to time, I have to deal with tasks when I need to connect two or more historical tables to each other within the framework of an existing DBMS, so as to get beautiful historical intervals at the output. What for? So that the report can correctly display the data for the date selected by the user, or the application pulled in these data for processing.
Often colleagues and brothers in the workshop are faced with similar tasks and advised how best to solve them.
In this article I want to share my experience of how various situations of this type were solved.

I’ll immediately mention that when I say the phrase “historical table” I mean
SCD Type 2 or SCD Type 6 .

We will also talk mainly about data warehouses. But some of the approaches described below are applied to OLTP solutions. It is assumed that in the tables being joined, referential integrity is maintained either in a natural way or by generating loss values at boot (values with a key, but with default values of mutable attributes).

It would seem that there are 2 tables, each has a key, changing attributes, there are dates of the interval of the record — join them by key and think of something at intervals. But not so simple.
Visually, the task is as follows:

But how to achieve the result in an optimal way depends on your situation.

Option one - there is not much data, tables too, every time everything can be read on the fly

In this option, you can simply create a view with a join of the tables among themselves, but taking into account the intersection of the intervals.

The connection condition can be schematically described as follows:

First_Table.Start_dt <= Second_Table.End_dt AND Second_Table.Start_dt <= First_Table.End_dt

Here the reader can begin to imagine this in his head, and calculate options that may not be taken into account by this condition. The graphically possible options for intersecting the intervals are as follows: The

options are also true if you change “Interval A” and “Interval B” in places. And, of course, the end of any of the intervals can be opened. How the design question is open (whether an infinitely distant date is selected or is NULL) is not critical for the solution, but when writing SQL you need to consider which option is in your system.

Inquisitive minds can check - the condition covers all options.

But wait, as a result of joining 2 rows from different tables in a similar way, you get 2 start dates and 2 end dates. Something must be chosen from them. For the resulting interval, which is obtained by connecting two intersecting intervals, the boundaries will be calculated as the largest of the 2 start dates and the smallest of 2 end dates. In terms of Oracle DBMS functions, this sounds like GREATEST (Start_DT) and LEAST (End_DT), respectively. The resulting resulting interval can be connected to the 3rd table. The result of the connection, after calculating the resulting start and end dates, can be connected to the 4th table, etc.

Depending on the DBMS used, SQL gets a different degree of reward, but the result is correct. It remains only to wrap it in CREATE VIEW and create a resulting showcase, which consumers (reports, applications, users) will access and indicate the relevant date of relevance.

Option two - a lot of data

If there is a lot of data, and using the usual representation described in the first embodiment does not satisfy the performance requirements, that is, an alternative - you can calculate the showcase step by step. Yes, I’m talking about saving the results of calculating a storefront in a table and sending data consumers to a table.

When and how to calculate data? Each time after completing the download of detailed data.
In this case, the task is reduced to the trivial one, the main thing is to know the business date of the data (reporting date, date of relevance, who calls it what) that just loaded. This is more the task of the ETL \ ELT Framework, which controls these processes. We are interested in date (or a discrete set of dates). A procedure is created (whether stored, or ETL \ ELT this is already determined by the religion) which takes a date or a set of dates as input. And then loop through them begins to execute the SQL, which connects all the keys on the table, and each table is superimposed historical condition of the form: WHERE Input_Date BETWEEN Current_Table.Start_DT and Current_Table.End_DT. This kind of SQL in most systems that I have ever seen, works pretty quickly, because records are filtered by a very selective condition, and then quickly connected.
The results are already placed in the window according to the mechanism for highlighting changes, because the showcase is the essence of SCD Type 2 or SCD Type 6, so you need to check whether the resulting record is modified.

The disadvantage of this option is that a large amount of work can be done in the “idle”. For example, there was no change in the data on the loaded date, but we still build a snapshot of all the data on that date and compare it with the storefront. And only after that we find out that no changes have occurred or happened for only 0.1% of the lines.

Option three - the data model represents a “snowflake” around one (or a limited set of) key, and there can even be a lot of data

This option is typical for data warehouses having a model close to the 3rd normal form.
A schematic version of the model looks something like this:

With this scheme, you can optimize the calculation process and select from the whole set of dates only those in which there were any real changes. How? Yes, it’s very simple to select unique combinations of Key_id and Start_dt from all tables. Those. UNION (note, not UNION ALL) from the Key_id, Start_dt combinations. Choosing UNION or DISTINCT from UNION ALL (or union ALL from DISTINCTs) depends on the DBMS used and the religion of the developer.
As a result of such a request, we get a set of keys and dates when something happened to these keys. Further, the obtained set can be connected to the tables under the condition of equality Key_id and occurrence of the obtained Start_dt in the recording record intervals of a specific table. Many will guess that absolutely all keys will be present in this sample, because they at least once fell into the repository. But this option can still be advantageous in performance, if the physical data model allows you to provide query performance for specific Key_id. And in cases where there is a significant subset of Key_id that often change state.

The described option can be implemented in two ways (depends on the DBMS and the amount of data). You can do this through view using the WITH clause (for Oracle, you can even do - + materialize to speed up the process) or in two steps through a procedure and an intermediate table. An intermediate table, in fact, can be one large (temporal) one and can be used for many storefronts if the key combinations match the data types. Those. in the first step of the procedure, a selection is made, and its results are stored in a table. In the second step, the table from step 1 is joined with the detailed data. Step 1.5 may be the collection of statistics on the table, if this is justified in this particular case.

There can be many more options, because the life and imagination of developers and customers is much richer, but I hope that at least these described options will help someone save time or will be the basis for creating even more optimal solutions.

Tags:

Join historical tables

Option one - there is not much data, tables too, every time everything can be read on the fly

Option two - a lot of data

Option three - the data model represents a “snowflake” around one (or a limited set of) key, and there can even be a lot of data

Also popular now: