Versioning and data history

    When developing databases, it is often required to provide support for versioning and storing the history of objects. For example, an employee may change the position, and the position may, in turn, change the salary - in multidimensional modeling, this is called Slowly changing dimensions (hereinafter SCD) - rarely changing dimensions, that is, dimensions whose non-key attributes tend to change over time. In total, there are 6 main types (methods) of SCD , which determine how the history of changes can be reflected in the model.



    Type 0


    It consists in the fact that the data after the first hit in the table is never further changed. This method is practically not used by anyone, because It does not support versioning. It is only needed as a zero reference point for the SCD methodology.

    Type 1


    Type 1 is the usual rewriting of old data with new ones. In its pure form, this method also does not contain versioning and is used only where the story is actually not needed. However, in some DBMSs for this type it is possible to add limited versioning support using the DBMS itself (for example, Flashback query in Oracle) or by tracking changes through triggers.

    Advantages:
    • No redundancy added
    • Very simple structure

    Disadvantages:
    • Does not store stories


    Type 2


    This method consists in creating for each version a separate record in the table with the addition of the key attribute field of this version, for example: version number, date of change or date of the beginning and end of the period of existence of the version.

    Example:
    IDNAMEPOSITION_IDDEPTDATE_STARTDATE_END
    1Kolya21208/11/2010 10:42:2501/01/9999
    2Denis23308/11/2010 10:42:2501/01/9999
    3Boris26208/11/2010 10:42:2501/01/9999
    4Sheldon22308/11/2010 10:42:2501/01/9999
    5a penny25208/11/2010 10:42:2501/01/9999


    In this example, the default version date is '01 .01.9999 ', instead of which it would be possible to specify, say, null, but then there would be a problem with creating a primary key from ID, DATE_START and DATE_END, and, in addition, the selection condition for a certain date is simplified (" where snapshot_date between DATE_START and DATE_END" instead of " where snapshot_date>DATE_START and (snapshot_date < DATE_END or DATE_END is null)".
    With this implementation, when an employee is dismissed, it will be possible to simply change the end date of the current version to the date of dismissal instead of deleting employee records.

    Advantages:
    • Keeps a complete and unlimited version history
    • Convenient and easy access to data of the required period

    Disadvantages:
    • Provokes for redundancy or the establishment of additional tables for storing mutable dimension attributes
    • It complicates the structure or adds redundancy in cases where analytics will need to reconcile the data in the fact table with specific versions of the dimension and the fact may not be consistent with the current version of the dimension for this fact. (For example, the client’s revision or address has changed, but operation / delivery according to old values)


    Type 3


    The record itself contains additional fields for previous attribute values. When new data is received, the old data is overwritten with the current values.

       IDUPDATE_TIMELAST_STATECURRENT_STATE
    1108/11/2010 12:58:4801
    2211/11/2010 12:29:1611
    Advantages:
    • Small amount of data
    • Easy and quick access to history

    Disadvantages:
    • Limited story


    Type 4


    The history of changes is contained in a separate table: the main table is always overwritten by current data with the transfer of old data to another table. Usually this type is used to audit changes or create archive tables (as I said, in Oracle the same 4th type can be obtained from the 1st using flashback archive). The subtype or hybrid of this option (with the second type), as it seems to me, should be considered partitioning based on the current version with the permitted movement of lines, but this is already beyond the scope of modeling and most likely relates to administration.

    Example:
    select * from emp
    

    IDNAMEPOSITION_IDDEPT
    1Kolya212
    2Denis233
    3Boris262
    4Sheldon223
    5a penny252


    select * from emp_history
    

    IDNAMEPOSITION_IDDEPTDATE
    1Kolya21111/11/2010 14:12:13
    2Denis23211/11/2010 14:12:13
    3Boris26111/11/2010 14:12:13
    4Sheldon22211/11/2010 14:12:13

    Advantages:
    • Quick work with current versions

    Disadvantages:
    • Separation of a single entity into different tables


    Hybrid Type / Type 6 (1 + 2 + 3)


    Type 6 was invented by Ralph Kimball as a combination of the above methods and is designed for situations that they do not take into account or for the convenience of working with data. It consists in introducing additional redundancy: type 2 is taken as a basis, a surrogate attribute is added for an alternative version overview (type 3), and one or all previous versions (type 1) are overwritten.
    Example:
    VERSIONIDNAMEPOSITION_IDDEPTDATE_STARTDATE_ENDCurrent
    11Kolya21208/11/2010 10:42:2501/01/99991
    12Denis23308/11/2010 10:42:2501/01/99991
    13Boris26208/11/2010 10:42:2508/11/2010 11:42:250
    23Boris26208/11/2010 11:42:2601/01/99991


    In this example, for example, adding a surrogate key adds the ability to refer from the fact tables to a specific version of the dimension, which may not belong to the time of the fact itself, and the indicator of the current version can help to partition by current versions (although it would be better to call sectioning by the latest version, as the version can become outdated without changing the record itself). However, the indicator of the current version can be created both as a virtual computable field, without impairing normalization, if it is necessary in the table (if the DBMS supports such fields, they appeared in Oracle in version 11), and as a field in the view from this table.
    In general, any combination of the main types of SCD refers to the hybrid type, so both their disadvantages and advantages depend on your particular implementation, but one thing is certain - the choice of a hybrid type can be determined only by the complexity of your model and almost always (in any case, I don’t know cases where it can be otherwise), you can do with the main 4 types.

    Let me add some tips on SCD implementations:
    • Try to implement a mechanism for changing records in stored procedures - it is categorically undesirable for the change code to be scattered in different places, even if the change code is stored in clearly defined places in your external application;
    • If you want to make a smooth transition from the 1st model to the second, you can do this:
      1) change the table according to type 2 SCD with renaming, for example, in table_name_scd2
      2) create an updated view with the name of the old table that will output data in that same structure as the old table;
      3) if you do not make all changes in stored procedures (I hope this is temporary :)) that you have already changed, then create triggers that will populate new fields if they are not set by the request (when: new.start_date is null. ..) and log it in order to then make sure that you have changed everything
    • In cases of using the start and end fields of the version, in addition to using the primary key, which includes the identifier of the object and the start and end dates of the version, you will need to control the integrity by creating a restriction on non-intersection of version dates. It’s very good if your DBMS supports check constraints based on non-deterministic functions that allow you to do this (by the way, I would like to know which DBMSs support this), but if it is not, then you can check the condition in the trigger before creating or changing and throw an exception in case of violation. Example for Oracle:
      create or replace trigger T_EMP_CHECK
       before insert or update on emp 
       for each row
      declare
          f_ok number;
      begin
       select count(*) into f_ok
       from dual
       where exists( select null 
                     from emp e 
                     where 
                          e.id         =  :new.id
                      and e.date_start <= :new.date_end
                      and e.date_end   >= :new.date_start
                   );
       if f_ok>0 then
          raise DUP_VAL_ON_INDEX;
       end if;
      end T_EMP_CHECK;
      

    • When switching from type 1 to type 4, you just need to create a before update trigger, in which you will store the records in a new table for archive records


    PS. Habralyudi, share what interesting hybrid implementations you met?

    Also popular now: