About the strategy and format of data storage in the Hadoop era
Data storage strategy
The current state of computing is that it is possible to store almost infinite amounts of data. As a result, the need to delete data in order to free up space for new ones is virtually eliminated.
This gives a lot of advantages, starting from the natural relationship between the data and the objects that they describe, since in nature there are conservation laws, the same should also apply to data that reflect natural objects, and ending with the exception of purely technological problems related to data integrity of time.
Thus, the storage strategy should be based on the paradigm of “soft” deletion, which is to mark the data as having lost relevance from a certain point in time.
Exactly the same applies to data changes. Updates should not overwrite previous data, but indicate that, starting from a certain point in time, the data have different meanings.
With a great desire to free up space by clearing storage from unused data, the storage compression strategy can be applied by making a copy of it and rewriting only actual data at a certain point in time in the past.
This reasoning is not something new, since it has already been implemented in Hadoop-type big data storages.
Data storage format
Data reflecting certain entities, as a rule, are a set of attributes whose composition reflects the necessary characteristics of the entity. For simplicity, we will assume that we are talking about a relational model consisting of tuples.
Thus, the data is stored in the form of tuples of a certain type, which over time may change and lose relevance.
We also mean that modern big data storages often have a key-value structure with a primary index on key and possible optional indexes on other attributes.
With these considerations in mind, the following data storage format is proposed.
Immediately, I would like to note that this format is not unique, but inspired by the data storage structure in 1C objects called “Register”. But in this development, the format is proposed to make universal and store all data in it.
So, a format is proposed for data records about entities and their attributes, based on the concept of workflow, based on the following definitions:
- An operation is an atomic change of a single data entity.
- Entity consists of attributes.
- An entity has a type that determines the composition of its attributes.
- Entities of the same type are stored in the same stream.
- A workflow is a storage object of the type table, where operations related to entities of the same type and changing their state are located.
Accordingly, each operation consists of an operation header and a set of attributes that depend on the type of entity:
- OpID - unique identifier of the operation
- OpTS - timestamp operation
- OpType - type of operation
- OpClass - stream name
- OpUser - user of the system that issued the command
- OpDoc is a transaction document, that is, the document that created it may not be installed
- OpComment - Operation Commentary
- ID - identifier of the entity to which the operation belongs
- Parameters - stream-dependent operation attributes
OpID and ID can be any, but for now, it may make sense to use a UID.
OpTS most likely should be of the timestamp type, but supplemented with an ordinal index, if several operations fall into one time interval in order to ensure an unambiguous order of operations.
OpType can be of any type, for example, one / several characters or a number.
OpClass, OpUser and OpComment can be either a string or some kind of links to the directory.
OpDoc provides a link to the document, but it may be absent. This is a link to the top level.
Operations are divided into basic and service.
Basic operations
Basic operations 3 - add, update, delete:
- Operation “A” add - establishes the instantiation of a new entity of a certain type and sets a set of attributes.
- Operation “U” update - states the change of an entity of a certain type and sets new values of a certain set of attributes.
- Operation “D” deletion - states the end of the reality of an entity of a certain type.
Operation A and U may not set all attributes, but only some. Those attributes that are not set by this operation may have a value of type NULL, or some other special value that is not yet available at the moment, but it would be nice to create it.
As a result, the actual value of the attributes of an entity at a specific point in time requires calculating them by searching backwards, selecting all attributes that are different from the special (not fixed) value.
When issuing an operation U, the system must check whether an operation A is available for this entity and, if it is absent, change the type of operation to A.
Operation D closes the existence of a certain entity and when querying the values of attributes of this entity with the point of relevance after this operation, the values “not set” should be returned for all attributes of this entity. When issuing an operation D, the system must check for the presence of operation A for this entity and, if it is absent, refuse to save the command D.
As an additional feature, this structure of operations allows you to organize the storage of an entity with the same ID with different attributes at different times not just based on attributes, but on the whole entity. That is, we may have several AN * UD blocks in which the entity exists, but it does not exist between D and A.
Service operations
Service operations can be many and their composition can be replenished. As an example, there are several considerations:
- Operation "N" invalid operation - this operation should be ignored by the system. You can change other types of operations to N to exclude them from work.
- Operation "C" cache - this operation can be created with a certain frequency and store the attribute values at a certain point in time in order to reduce the cost of finding the attribute values in depth. Details of the parameters of the operation can be stored, for example, in the comments or in the operation code itself. Of course, when applying basic operations, operations of type C must be recalculated or replaced by N.
- Operation “S” group operations - this operation can be created with a certain frequency and store group values (for example, amounts, averages and so on) of attributes of numeric types for a certain period. Details of the parameters of the operation can be stored, for example, in the comments or in the operation code itself. Of course, when applying basic operations, operations of type S must be recalculated or replaced by N.
- Operation “G” group attributes - this operation may be similar to U, but at the same time, certain system commands will produce not one value of attribute (s), but several. One attribute value per operation A / U, the other values at operation G, which are located between adjacent A / U.
Service operations are not necessary, but they can provide additional service to the storage system and improve its operation. Their composition may be different for different systems.