InfoWatch October 11, 2016 at 17:46

Data serialization or communication dialectics: simple serialization

Good day, dear. In this article, we will consider the most popular data serialization formats and conduct a little testing with them. This is the first article on the topic of data serialization, and in it we will consider simple serializers that do not require the developer to make big changes in the code to integrate them.

Sooner or later, but you, like our company, may encounter a situation where the number of services used in your product increases dramatically, and all of them also turn out to be very "talkative." Whether this happened due to the transition to the “hype” microservice architecture now or you just got a bunch of orders for small improvements and implemented them with a bunch of services - it doesn’t matter. The important thing is that starting from this moment, your product has got two new problems - what to do with the increased amount of data being chased between separate services, and how to prevent chaos in the development and support of so many services. Let me explain a little about the second problem: when the number of your services grows to hundreds or more, they cannot be developed and supported by one development team, therefore, you distribute packs of services to different teams. And here the main thing is that all these teams use the same format for their RPC, otherwise you will encounter such classic problems when one team cannot support the services of the other or just two services do not fit together without abundant compression of the junction with crutches. But we will talk about this in a separate article, and today we will pay attention to the first problem of increased data and think about what we can do about it. And because of our Orthodox laziness, we don’t want to do anything, but we want to add a couple of lines to the general code and get a profit right away. We will start from this in this article, namely, we will consider serializers, the embedding of which does not require large changes in our beautiful RPC. otherwise, you will encounter such classic problems when one team cannot support the services of the other, or simply two services do not fit together without abundant compaction of the junction with crutches. But we will talk about this in a separate article, and today we will pay attention to the first problem of increased data and think about what we can do about it. And because of our Orthodox laziness, we don’t want to do anything, but we want to add a couple of lines to the general code and get a profit right away. We will start from this in this article, namely, we will consider serializers, the embedding of which does not require large changes in our beautiful RPC. otherwise, you will encounter such classic problems when one team cannot support the services of the other, or simply two services do not fit together without abundant compaction of the junction with crutches. But we will talk about this in a separate article, and today we will pay attention to the first problem of increased data and think about what we can do about it. And because of our Orthodox laziness, we don’t want to do anything, but we want to add a couple of lines to the general code and get a profit right away. We will start from this in this article, namely, we will consider serializers, the embedding of which does not require large changes in our beautiful RPC. But we will talk about this in a separate article, and today we will pay attention to the first problem of increased data and think about what we can do about it. And because of our Orthodox laziness, we don’t want to do anything, but we want to add a couple of lines to the general code and get a profit right away. We will start from this in this article, namely, we will consider serializers, the embedding of which does not require large changes in our beautiful RPC. But we will talk about this in a separate article, and today we will pay attention to the first problem of increased data and think about what we can do about it. And because of our Orthodox laziness, we don’t want to do anything, but we want to add a couple of lines to the general code and get a profit right away. We will start from this in this article, namely, we will consider serializers, the embedding of which does not require large changes in our beautiful RPC.

The format issue, in fact, is rather painful for our company, because our current products use the xml format to exchange information between components. No, we are not masochists, we are well aware that it was worth 10 years ago to use xml for data exchange, but this is precisely the reason - the product is already 10 years old, and it contains many legacy-architectural solutions that are quite difficult to “cut” quickly . With a little reflection and holivariy, we decided that we would use JSON to store and transfer data, but we need to choose some of the JSON packaging options, since the size of the transmitted data is critical for us (I will explain why so below).

We have put together a list of criteria by which we will choose the format that suits us:

The effectiveness of data compression. Our product will handle a huge number of input events from various sources. Each event is triggered by some user action. Basically, the events are small and contain meta-information about what is happening - sent a letter, chatted something on Facebook, etc. - but may also contain data, and of a rather large size. In addition, the number of such events is very large, several dozens of TB can be transmitted per day easily, therefore, saving the size of events is crucial for us.
Ability to work from different languages. Since our new project was written using C ++, PHP and JS, we were interested in supporting only these languages, but taking into account the fact that the microservice architecture allows for heterogeneous development environments, support for additional languages will come in handy. Let's say go is quite interesting for us, and it is possible that some services will be implemented on it.
Support for versioning / evolving data structures. Since our products live for quite a long time without updating from customers (the update process is not at all simple), at some point there will be too many different versions on the support, and it is important that we can easily develop the storage format without losing compatibility with already packed data.
Ease of use. We have experience using the Thrift protocol to build communication between components. Honestly, it's not always easy for developers to figure out how RPC works and how to add something to existing code without breaking anything in the old one. Therefore, the easier it is to use the serialization format, the better, since the level of the C ++ developer and JS developer in such things is completely different :)
The ability to read data randomly (Random-access reads / writes). Since we also mean the use of the selected format for data storage, it would be great if it supported the possibility of partial deserialization of the data so as not to read the whole object every time, which is often not at all small. In addition to reading data, a big plus would be the ability to change data without subtracting all the content.

After analyzing a decent number of options, we selected the following candidates for ourselves:

Json
BSON
Message pack
Cbor

These formats do not require a description of the IDL scheme of transmitted data, but contain a data scheme within themselves. This greatly simplifies the work and allows in most cases to add support by writing no more than 10 lines of code.

We are also well aware that some factors of the protocol or serializer are highly dependent on its implementation. That which perfectly packs in C ++ may poorly pack in Javascript. Therefore, for our experiments we will use implementations for JS and Go and we will run tests. We will drive the JS implementation for fidelity in the browser and nodejs.

So, let's get to the consideration.

Json

The simplest of the interaction formats we are considering. When comparing other formats, we will use it as a reference, since in our current projects it showed its effectiveness and showed all its disadvantages.

Pros:

It supports almost all the data types we need. One could complain about the lack of support for binary data, but base64 can be dispensed with here.
Easy to read by man, which makes debugging easy
It is supported by a bunch of languages (although those who used JSON in Go will understand that I’m cunning here)
You can implement versioning through JSON Scheme

Minuses:

Despite the compactness of JSON compared to xml, in our project, where gigabytes of data are transmitted per day, it is still quite wasteful for channels and for storing data in it. The only plus of native JSON we see only in using PostgreSQL for storage (with its capabilities for working with jsob).
There is no support for partial data deserialization. To get something from the middle of the JSON file, you must first deserialize everything that comes before the desired field. Also, this does not allow using the format for stream processing, which may be useful in network communication.

Let's see what we have with performance. When considering, we will immediately try to take into account the lack of JSON in its size and make tests with JSON packing using zlib. For tests, we will use the following libraries:

http://nodeca.github.io/pako/ - for packing JSON in JS
http://github.com/klauspost/compress - for packing JSON in Go
http://github.com/pquerna/ffjson - as a JSON serializer in Go

You can find the source and all test results at the following links:

Go - https://github.com/KyKyPy3/serialization-tests
JS (node) - https://github.com/KyKyPy3/js-serialization-tests
JS (browser ) - http://jsperv.com/serialization-benchmarks/5

Empirically, we found that the data for the tests should be taken as close as possible to the real ones, because the test results with different test data differ dramatically. So if it’s important for you not to miss the format, always test it on the data closest to your realities. We will test on data that is close to our realities. You can look at them in the test sources.

Here's what we got for JSON in speed. Below are the benchmark results for the respective languages:

JS (Node)
Json encode	21,507 ops / sec (86 runs sampled)
Json decode	9,039 ops / sec (89 runs sampled)
Json roundtrip	6.090 ops / sec (93 runs sampled)
Json compres encode	1,168 ops / sec (84 runs sampled)
Json compres decode	2,980 ops / sec (93 runs sampled)
Json compres roundtrip	874 ops / sec (86 runs sampled)

JS (browser)
Json roundtrip	5,754 ops / sec
Json compres roundtrip	890 ops / sec

Go
Json encode	5000	391100 ns / op	24.37 MB / s	54520 B / op	1478 allocs / op
Json decode	3000	392785 ns / op	24.27 MB / s	76634 B / op	1430 allocs / op
Json roundtrip	2000	796115 ns / op	11.97 MB / s	131150 B / op	2908 allocs / op
Json compres encode	3000	422254 ns / op	0.00 MB / s	54790 B / op	1478 allocs / op
Json compres decode	3000	464569 ns / op	4.50 MB / s	117206 B / op	1446 allocs / op
Json compres roundtrip	2000	881305 ns / op	0.00 MB / s	171795 B / op	2915 allocs / op

And here's what we got in terms of data size:

JS (Node)
Json	9482 bytes
Json compressed	1872 bytes

JS (Browser)
Json	9482 bytes
Json compressed	1872 bytes

At this stage, we can conclude that although JSON compression gives an excellent result, the loss in processing speed is simply disastrous. Another conclusion: JS works fine with JSON, which cannot be said, for example, about go. It is possible that processing JSON in other languages will show results not comparable to JS. For now, put the JSON results aside and see how it will be with other formats.

BSON

This data format came from MongoDb and is actively promoting it. The format was originally developed for data storage and was not intended for transmission over the network. Honestly, after a brief search on the Internet, we did not find a single serious product that uses BSON inside itself. But let's see what this format can give us.

Pros:

Support for additional data types.
According to the specification, the BSON format, in addition to the standard data types of the JSON format, BSON also supports types such as Date, ObjectId, Null, and binary data. Some of them (for example, ObjectId) are more often used in MongoDb and may not always be useful to others. But some additional data types give us the following bonuses. If we store a date in our object, then in the case of the JSON format, we have only one storage option - this is one of the ISO-8601 options, and in a string representation. At the same time, if we want to filter our collection of JSON objects by dates, during processing we will need to turn strings into Date format and only then compare them with each other. BSON, on the other hand, stores all dates as Int64 (just like the Date type) and takes care of all the work of serializing / deserializing into Date format. Therefore, we can compare dates without deserialization - just like numbers, which is clearly faster than the version with classic JSON.
BSON supports the so-called Random read / write to its data.
BSON stores lengths for strings and binary data, allowing you to skip attributes that we are not interested in. JSON sequentially reads data and cannot miss an element without reading its value to the end. Thus, if we store large amounts of binary data inside the format, this feature can play an important role for us.

Minuses:

The size of the data.
As for the size of the final file, then everything is ambiguous. In some situations, the size of the object will be smaller, and in some - more, it all depends on what lies inside the Bson of the object. Why is this so - we will answer the specification, which says that for the speed of access to the elements of the object, the format saves additional information, such as data size for large elements.

So for example a JSON object

{«hello": "world»}

will turn into this:

\x16\x00\x00\x00                  // total document size
\x02                               // 0x02 = type String
hello\x00                          // field name
\x06\x00\x00\x00world\x00          // field value
\x00                               // 0x00 = type EOO ('end of object')

The specification says that BSON was developed as a format with fast serialization / deserialization, at least due to the fact that it stores numbers as an Int type, and does not waste time parsing them from a string. Let's check. For testing, we took the following libraries:

JS (node and browser) - https://github.com/mongodb/js-bson
Go - https://github.com/go-mgo/mgo/tree/v2

And here are the results we got (for clarity, I also added results for JSON):

JS (Node)
Json encode	21,507 ops / sec (86 runs sampled)
Json decode	9,039 ops / sec (89 runs sampled)
Json roundtrip	6.090 ops / sec (93 runs sampled)
Json compres encode	1,168 ops / sec (84 runs sampled)
Json compres decode	2,980 ops / sec (93 runs sampled)
Json compres roundtrip	874 ops / sec (86 runs sampled)
Bson encode	93.21 ops / sec (76 runs sampled)
Bson decode	242 ops / sec (84 runs sampled)
Bson roundtrip	65.24 ops / sec (65 runs sampled)

JS (browser)
Json roundtrip	5,754 ops / sec
Json compres roundtrip	890 ops / sec
Bson roundtrip	374 ops / sec

Go
Json encode	5000	391100 ns / op	24.37 MB / s	54520 B / op	1478 allocs / op
Json decode	3000	392785 ns / op	24.27 MB / s	76634 B / op	1430 allocs / op
Json roundtrip	2000	796115 ns / op	11.97 MB / s	131150 B / op	2908 allocs / op
Json compres encode	3000	422254 ns / op	0.00 MB / s	54790 B / op	1478 allocs / op
Json compres decode	3000	464569 ns / op	4.50 MB / s	117206 B / op	1446 allocs / op
Json compres roundtrip	2000	881305 ns / op	0.00 MB / s	171795 B / op	2915 allocs / op
Bson encode	10,000	249024 ns / op	40.42 MB / s	70085 B / op	982 allocs / op
Bson decode	3000	524408 ns / op	19.19 MB / s	124777 B / op	3580 allocs / op
Bson roundtrip	2000	712524 ns / op	14.13 MB / s	195334 B / op	4562 allocs / op

And here's what we got in terms of data size:

JS (Node)
Json	9482 bytes
Json compressed	1872 bytes
Bson	112710 bytes

JS (Browser)
Json	9482 bytes
Json compressed	1872 bytes
Bson	9618 bytes

Although BSON gives us the possibility of additional data types and, most importantly, the ability to partially read / modify data, in terms of data compression, everything is very sad for him, so we are forced to continue searching further.

Message pack

The next format that got to our table is Message Pack. This format is quite popular lately and I personally found out about it when I was picking with tarantool.

If you look at the format website, you can:

Find out that the format is actively used by such products as redis and fluentd, which inspires confidence in it.
See the loud inscription It's like JSON. but fast and small

We’ll have to check how true this is, but first, let's see what the format offers us.

By tradition, let's start with the pros:

The format is fully compatible with JSON.
When converting data from MessagePack to JSON, we will not lose data, which cannot be said, for example, about the BSON format. True, there are a number of restrictions imposed on various data types:
1. The value of type Integer is limited from - (263) to (264) –1;
2. The maximum length of a binary object is (232) –1;
3. The maximum size of bytes of a line (232) –1;
4. The maximum number of elements in the array is no more than (232) –1;
5. The maximum number of elements in the associative array is not more than (232) –1;
Pretty good at shaking data.
For example, {"a": 1, "b": 2} occupies 13 bytes in JSON, 19 bytes in BSON and only 7 bytes in MessagePack, which is pretty good.
It is possible to expand the supported data types.
MsgPack allows you to expand its type system with its own. Since the type in MsgPack is encoded with a number, and values from –1 to –128 are reserved by the format (this is stated in the format specification), values from 0 to 127 are available for use. Therefore, we can add extensions that will point to our own types data.
It has support in a huge number of languages.
There is an RPC package (but this is not so important for us).
You can use the streaming API.

Minuses:

Does not support partial data modification.
Unlike the BSON format, even if MsgPack stores the sizes of each field, partially changing the data in it will fail. Suppose we have a serialized representation of JSON {"a": 1, "b": 2}. Bson uses 5 bytes to store the value of the 'a' key, which allows us to change the value from 1 to 2000 (it takes 3 bytes) without any problems. But MessagePack uses 1 byte for storage, and since 2000 takes 3 bytes, without shifting the data about parameter 'b', we cannot change the value of parameter 'a'.

Now let's see how productive it is and how it compresses data. The following libraries were used for tests:

The results we obtained are as follows:

JS (Node)
Json encode	21,507 ops / sec (86 runs sampled)
Json decode	9,039 ops / sec (89 runs sampled)
Json roundtrip	6.090 ops / sec (93 runs sampled)
Json compres encode	1,168 ops / sec (84 runs sampled)
Json compres decode	2,980 ops / sec (93 runs sampled)
Json compres roundtrip	874 ops / sec (86 runs sampled)
Bson encode	93.21 ops / sec (76 runs sampled)
Bson decode	242 ops / sec (84 runs sampled)
Bson roundtrip	65.24 ops / sec (65 runs sampled)
Msgpack encode	4,758 ops / sec (79 runs sampled)
Msgpack decode	2,632 ops / sec (91 runs sampled)
Msgpack roundtrip	1,692 ops / sec (91 runs sampled)

JS (browser)
Json roundtrip	5,754 ops / sec
Json compres roundtrip	890 ops / sec
Bson roundtrip	374 ops / sec
Msgpack roundtrip	1,048 ops / sec

Go
Json encode	5000	391100 ns / op	24.37 MB / s	54520 B / op	1478 allocs / op
Json decode	3000	392785 ns / op	24.27 MB / s	76634 B / op	1430 allocs / op
Json roundtrip	2000	796115 ns / op	11.97 MB / s	131150 B / op	2908 allocs / op
Json compres encode	3000	422254 ns / op	0.00 MB / s	54790 B / op	1478 allocs / op
Json compres decode	3000	464569 ns / op	4.50 MB / s	117206 B / op	1446 allocs / op
Json compres roundtrip	2000	881305 ns / op	0.00 MB / s	171795 B / op	2915 allocs / op
Bson encode	10,000	249024 ns / op	40.42 MB / s	70085 B / op	982 allocs / op
Bson decode	3000	524408 ns / op	19.19 MB / s	124777 B / op	3580 allocs / op
Bson roundtrip	2000	712524 ns / op	14.13 MB / s	195334 B / op	4562 allocs / op
Msgpack encode	5000	306260 ns / op	27.36 MB / s	49907 B / op	968 allocs / op
Msgpack decode	10,000	214967 ns / op	38.98 MB / s	59649 B / op	1690 allocs / op
Msgpack roundtrip	3000	547434 ns / op	15.31 MB / s	109754 B / op	2658 allocs / op

And here's what we got in terms of data size:

JS (Node)
Json	9482 bytes
Json compressed	1872 bytes
Bson	112710 bytes
Msgpack	7628 bytes

JS (Browser)
Json	9482 bytes
Json compressed	1872 bytes
Bson	9618 bytes
Msgpack	7628 bytes

Of course, MessagePack does not squeeze the data as cool as we would like, but at least it behaves quite stably in both JS and Go. Perhaps at the moment this is the most attractive candidate for our tasks, but it remains to consider our last patient.

Cbor

Honestly, the format is very similar to MessagePack in its capabilities, and it seems that the format was developed as a replacement for MessagePack. It also has data type extension support and full JSON compatibility. Of the differences, I noticed only support for arrays / strings of arbitrary length, but, in my opinion, this is a very strange feature. If you want to know more about this format, then there was an excellent article on Habr - habrahabr.ru/post/208690 . Well, we'll see how Cbor has performance and data compression.

The following libraries were used for tests:

Js - https://github.com/paroga/cbor-js
Go - https://github.com/ugorji/go

And, of course, here are the final results of our tests, taking into account all the formats considered:

JS (Node)
Json encode	21,507 ops/sec ±1.01% (86 runs sampled)
Json decode	9,039 ops/sec ±0.90% (89 runs sampled)
Json roundtrip	6,090 ops/sec ±0.62% (93 runs sampled)
Json compres encode	1,168 ops/sec ±1.20% (84 runs sampled)
Json compres decode	2,980 ops/sec ±0.43% (93 runs sampled)
Json compres roundtrip	874 ops/sec ±0.91% (86 runs sampled)
Bson encode	93.21 ops/sec ±0.64% (76 runs sampled)
Bson decode	242 ops/sec ±0.63% (84 runs sampled)
Bson roundtrip	65.24 ops/sec ±1.27% (65 runs sampled)
MsgPack encode	4,758 ops/sec ±1.13% (79 runs sampled)
MsgPack decode	2,632 ops/sec ±0.90% (91 runs sampled)
MsgPack roundtrip	1,692 ops/sec ±0.83% (91 runs sampled)
Cbor encode	1,529 ops/sec ±4.13% (89 runs sampled)
Cbor decode	1,198 ops/sec ±0.97% (88 runs sampled)
Cbor roundtrip	351 ops/sec ±3.28% (77 runs sampled)

JS (browser)
Json roundtrip	5,754 ops/sec ±0.63%
Json compres roundtrip	890 ops/sec ±1.72%
Bson roundtrip	374 ops/sec ±2.22%
MsgPack roundtrip	1,048 ops/sec ±5.40%
Cbor roundtrip	859 ops/sec ±4.19%

Go
Json encode	5000	391100 ns/op	24.37 MB/s	54520 B/op	1478 allocs/op
Json decode	3000	392785 ns/op	24.27 MB/s	76634 B/op	1430 allocs/op
Json roundtrip	2000	796115 ns/op	11.97 MB/s	131150 B/op	2908 allocs/op
Json compres encode	3000	422254 ns/op	0.00 MB/s	54790 B/op	1478 allocs/op
Json compres decode	3000	464569 ns/op	4.50 MB/s	117206 B/op	1446 allocs/op
Json compres roundtrip	2000	881305 ns/op	0.00 MB/s	171795 B/op	2915 allocs/op
Bson Encode	10000	249024 ns/op	40.42 MB/s	70085 B/op	982 allocs/op
Bson Decode	3000	524408 ns/op	19.19 MB/s	124777 B/op	3580 allocs/op
Bson Roundtrip	2000	712524 ns/op	14.13 MB/s	195334 B/op	4562 allocs/op
MsgPack Encode	5000	306260 ns/op	27.36 MB/s	49907 B/op	968 allocs/op
MsgPack Decode	10000	214967 ns/op	38.98 MB/s	59649 B/op	1690 allocs/op
MsgPack Roundtrip	3000	547434 ns/op	15.31 MB/s	109754 B/op	2658 allocs/op
Cbor Encode	20000	71203 ns/op	117.48 MB/s	32944 B/op	12 allocs/op
Cbor Decode	3000	432,005 ns / op	19.36 MB / s	40216 B / op	2159 allocs / op
Cbor roundtrip	3000	531434 ns / op	15.74 MB / s	73160 B / op	2171 allocs / op

And here's what we got in terms of data size:

JS (Node)
Json	9482 bytes
Json compressed	1872 bytes
Bson	112710 bytes
Msgpack	7628 bytes
Cbor	7617 bytes

JS (Browser)
Json	9482 bytes
Json compressed	1872 bytes
Bson	9618 bytes
Msgpack	7628 bytes
Cbor	7617 bytes

Comments, I think, are redundant here, everything is clearly visible from the results - CBor turned out to be the slowest format.

conclusions

What conclusions have we drawn from this comparison? After thinking a little and looking at the results, we came to the conclusion that we were not satisfied with any of the formats. Yes, MsgPack proved to be a very good option: it is easy to use and quite stable, but after consulting with colleagues, we decided to take a fresh look at other binary data formats, not based on JSON: Protobuf, FlatBuffers, Cap'n proto and avro. We will discuss what we did and what we ultimately chose in the next article.

Posted by Roman Efremenko KyKyPy3uK

Tags:

Data serialization or communication dialectics: simple serialization

Json

BSON

Message pack

Cbor

conclusions

Also popular now: