xtender August 12, 2010 at 10:12

Versioning and data history

When developing databases, it is often required to provide support for versioning and storing the history of objects. For example, an employee may change the position, and the position may, in turn, change the salary - in multidimensional modeling, this is called Slowly changing dimensions (hereinafter SCD) - rarely changing dimensions, that is, dimensions whose non-key attributes tend to change over time. In total, there are 6 main types (methods) of SCD , which determine how the history of changes can be reflected in the model.

Type 0

It consists in the fact that the data after the first hit in the table is never further changed. This method is practically not used by anyone, because It does not support versioning. It is only needed as a zero reference point for the SCD methodology.

Type 1

Type 1 is the usual rewriting of old data with new ones. In its pure form, this method also does not contain versioning and is used only where the story is actually not needed. However, in some DBMSs for this type it is possible to add limited versioning support using the DBMS itself (for example, Flashback query in Oracle) or by tracking changes through triggers.

Advantages:

No redundancy added
Very simple structure

Disadvantages:

Does not store stories

Type 2

This method consists in creating for each version a separate record in the table with the addition of the key attribute field of this version, for example: version number, date of change or date of the beginning and end of the period of existence of the version.

Example:

ID	NAME	POSITION_ID	DEPT	DATE_START	DATE_END
1	Kolya	21	2	08/11/2010 10:42:25	01/01/9999
2	Denis	23	3	08/11/2010 10:42:25	01/01/9999
3	Boris	26	2	08/11/2010 10:42:25	01/01/9999
4	Sheldon	22	3	08/11/2010 10:42:25	01/01/9999
5	a penny	25	2	08/11/2010 10:42:25	01/01/9999

In this example, the default version date is '01 .01.9999 ', instead of which it would be possible to specify, say, null, but then there would be a problem with creating a primary key from ID, DATE_START and DATE_END, and, in addition, the selection condition for a certain date is simplified (" where snapshot_date between DATE_START and DATE_END" instead of " where snapshot_date>DATE_START and (snapshot_date < DATE_END or DATE_END is null)".
With this implementation, when an employee is dismissed, it will be possible to simply change the end date of the current version to the date of dismissal instead of deleting employee records.

Advantages:

Keeps a complete and unlimited version history
Convenient and easy access to data of the required period

Disadvantages:

Provokes for redundancy or the establishment of additional tables for storing mutable dimension attributes
It complicates the structure or adds redundancy in cases where analytics will need to reconcile the data in the fact table with specific versions of the dimension and the fact may not be consistent with the current version of the dimension for this fact. (For example, the client’s revision or address has changed, but operation / delivery according to old values)

Type 3

The record itself contains additional fields for previous attribute values. When new data is received, the old data is overwritten with the current values.

	ID	UPDATE_TIME	LAST_STATE	CURRENT_STATE
1	1	08/11/2010 12:58:48	0	1
2	2	11/11/2010 12:29:16	1	1

Advantages:

Small amount of data
Easy and quick access to history

Disadvantages:

Limited story

Type 4

The history of changes is contained in a separate table: the main table is always overwritten by current data with the transfer of old data to another table. Usually this type is used to audit changes or create archive tables (as I said, in Oracle the same 4th type can be obtained from the 1st using flashback archive). The subtype or hybrid of this option (with the second type), as it seems to me, should be considered partitioning based on the current version with the permitted movement of lines, but this is already beyond the scope of modeling and most likely relates to administration.

Example:

select * from emp

ID	NAME	POSITION_ID	DEPT
1	Kolya	21	2
2	Denis	23	3
3	Boris	26	2
4	Sheldon	22	3
5	a penny	25	2

select * from emp_history

ID	NAME	POSITION_ID	DEPT	DATE
1	Kolya	21	1	11/11/2010 14:12:13
2	Denis	23	2	11/11/2010 14:12:13
3	Boris	26	1	11/11/2010 14:12:13
4	Sheldon	22	2	11/11/2010 14:12:13

Advantages:

Quick work with current versions

Disadvantages:

Separation of a single entity into different tables

Hybrid Type / Type 6 (1 + 2 + 3)

Type 6 was invented by Ralph Kimball as a combination of the above methods and is designed for situations that they do not take into account or for the convenience of working with data. It consists in introducing additional redundancy: type 2 is taken as a basis, a surrogate attribute is added for an alternative version overview (type 3), and one or all previous versions (type 1) are overwritten.
Example:

VERSION	ID	NAME	POSITION_ID	DEPT	DATE_START	DATE_END	Current
1	1	Kolya	21	2	08/11/2010 10:42:25	01/01/9999	1
1	2	Denis	23	3	08/11/2010 10:42:25	01/01/9999	1
1	3	Boris	26	2	08/11/2010 10:42:25	08/11/2010 11:42:25	0
2	3	Boris	26	2	08/11/2010 11:42:26	01/01/9999	1

In this example, for example, adding a surrogate key adds the ability to refer from the fact tables to a specific version of the dimension, which may not belong to the time of the fact itself, and the indicator of the current version can help to partition by current versions (although it would be better to call sectioning by the latest version, as the version can become outdated without changing the record itself). However, the indicator of the current version can be created both as a virtual computable field, without impairing normalization, if it is necessary in the table (if the DBMS supports such fields, they appeared in Oracle in version 11), and as a field in the view from this table.
In general, any combination of the main types of SCD refers to the hybrid type, so both their disadvantages and advantages depend on your particular implementation, but one thing is certain - the choice of a hybrid type can be determined only by the complexity of your model and almost always (in any case, I don’t know cases where it can be otherwise), you can do with the main 4 types.

Let me add some tips on SCD implementations:

Try to implement a mechanism for changing records in stored procedures - it is categorically undesirable for the change code to be scattered in different places, even if the change code is stored in clearly defined places in your external application;
If you want to make a smooth transition from the 1st model to the second, you can do this:
1) change the table according to type 2 SCD with renaming, for example, in table_name_scd2
2) create an updated view with the name of the old table that will output data in that same structure as the old table;
3) if you do not make all changes in stored procedures (I hope this is temporary :)) that you have already changed, then create triggers that will populate new fields if they are not set by the request (when: new.start_date is null. ..) and log it in order to then make sure that you have changed everything
In cases of using the start and end fields of the version, in addition to using the primary key, which includes the identifier of the object and the start and end dates of the version, you will need to control the integrity by creating a restriction on non-intersection of version dates. It’s very good if your DBMS supports check constraints based on non-deterministic functions that allow you to do this (by the way, I would like to know which DBMSs support this), but if it is not, then you can check the condition in the trigger before creating or changing and throw an exception in case of violation. Example for Oracle:
```
create or replace trigger T_EMP_CHECK
 before insert or update on emp 
 for each row
declare
    f_ok number;
begin
 select count(*) into f_ok
 from dual
 where exists( select null 
               from emp e 
               where 
                    e.id         =  :new.id
                and e.date_start <= :new.date_end
                and e.date_end   >= :new.date_start
             );
 if f_ok>0 then
    raise DUP_VAL_ON_INDEX;
 end if;
end T_EMP_CHECK;
```
When switching from type 1 to type 4, you just need to create a before update trigger, in which you will store the records in a new table for archive records

PS. Habralyudi, share what interesting hybrid implementations you met?

Tags:

Versioning and data history

Type 0

Type 1

Advantages:

Disadvantages:

Type 2

Advantages:

Disadvantages:

Type 3

Advantages:

Disadvantages:

Type 4

Advantages:

Disadvantages:

Hybrid Type / Type 6 (1 + 2 + 3)

Also popular now: