Variable Data is Long - DataSizeVariable (DSV)

Hello!

I have long wanted to write an article. I myself do not like long texts with a bit of useful information, so I will try to make this as saturated as possible.

The general theme is efficient data packaging, serialization and deserialization of objects.
The main goal is to share our thoughts on this and discuss the DSV data structure.

Problem :
The mechanisms of binary serialization currently known to me (2013-09-19 18:09:56) have insufficient flexibility or redundancy in the space occupied. For example:
QString s1 (“123”); -> 4 bytes of data size = 0x00000003, 3 bytes of useful data = “123”, efficiency = 3/7;
U32 val1 (123); -> 4 bytes of data (0x0000007B), 1 byte of which is significant = 123 (0x7B), efficiency = 1/4.

Possible solution:
Level 1 - natural numbers:
DataSizeVariable (DSV) - variable data is long with minimal excessive memory overhead. The DSV format describes the preservation and restoration of non-negative integers in the range [0; ∞]. This provides theoretically unlimited scalability and binary data compatibility according to the same rules from 8-bit controllers to servers and clusters.

The essence of the format is that the most significant bit of each (8-bit) byte is a sign of its extension, the remaining bits (7 bits) are informational. Thus, having analyzed it, we are able to determine the boundary of the current value (number). If it is “0”, then the number is over, if “1” - in the next byte, the number continues. Data is allocated from the more significant parts to the less significant, from left to right (big-endian), one byte at a time. All the useful bits of the number are packed in 7 bits into one byte of the DSV format and, if necessary, a sign of expanded value is added (high, 8th bit). Examples:

image

Level 2 - objects:
At this level, a number in the DSV format is used for objects of arbitrary size.
Object1_SizeDataInDSV, Object1_Data, Object2_SizeDataInDSV, Object2_Data, ...

Level 3 - sequences of objects:
At this level, a DSV format number is used to indicate the size of the sequence of objects whose elements are objects in the DSV format.
Sequence1_SizeDataInDSV, Sequence1_Data (Object1_SizeDataInDSV, Object1_Data, Object2_SizeDataInDSV, Object2_Data, ...), Sequence2_SizeDataInDSV, Sequence2_Data (...), ...

So you can build an hierarchy of objects like XML.
Since the DSV format is binary, unlike XML, the direct and inverse transformations are 10 ... 1000 times faster and take up 2 ... 5 times less memory (due to the lack of the need to convert data to text and vice versa).

If someone knows projects similar in functionality, please tell me.
If someone is interested in implementing, then here link to the source code of the mini-library.

Also popular now: