The dark side of protobuf
Among developers, it is often believed that the protobuf serialization protocol and its implementation is a special, outstanding technology that can solve all real and potential performance problems with one fact of its application in the project. Perhaps this perception is affected by the ease of use of this technology and the credibility of Google itself.
Unfortunately, on one of the projects I had to come up against some features that are not mentioned in the advertising documentation, but strongly affect the technical characteristics of the project.
All of the following discussion applies only to the implementation of protobuf on the Java platform. Version 2.6.1 is also mainly described, although in the already released version 3.0.0, I also did not see any fundamental changes.
I also draw the fact that the article does not claim to be a complete review. About the good sides of the technology (for example, it is multilingual and excellent documentation) can be read on the official website . This article only talks about the problems and, perhaps, will allow you to make a more informed decision. One part of the problems relates to the format itself, another part of the problems relates to implementation. It is also necessary to clarify that most of the problems mentioned here occur under certain conditions.
A maven project with already connected dependencies for independent research can be taken on github .
This is the smallest problem, I didn’t even want to include it in the list, but let it be mentioned for completeness. In order to get java code you need to run the protoc compiler. Some problem is that this compiler is a native application and on each of the platforms the executable file will be its own, so it will not work out by simply connecting the maven plugin. At a minimum, you need an environment variable on the development machines and on the CI server that will point to the executable file, and after that it can already be run from the maven / ant script.
Alternatively, you can do maven-pluging, which holds all the binaries in the resources and unpacks the necessary under the current platform into a temporary folder, from where it launches it. I don’t know, maybe someone already did that.
In general, there is little sin, therefore we will forgive.
Unfortunately, for the Java platform, the protoc generator produces very impractical code. Instead of generating neat anemic containers and serializers for them separately, the generator shoves everything into one big class with subclasses. Generated beans can neither be embedded in their hierarchy, nor can they even banally implement the java.util.Serializable interface to push the beans to the other side. In general, they are only suitable as highly specialized DTOs. If this suits you, then this is not a problem at all, just do not look inside.
Actually, here I started to have completely objective problems. The generated code for each entity described (let's call it “Bean”) creates two classes (and one interface, but it is not important in this context). The first class is an immutable Bean which is a read-only nugget of data, the second class is a mutable Bean.Builder, which can already be edited and set values.
Why is this done, it remains incomprehensible. Someone says that the authors are part of the sect of adherents of FP; someone claims that they tried to get rid of cyclic dependencies during serialization (how did this help them?); someone says that protobuf of the first version worked only with mutable classes, and stupid people shot themselves at the same time.
One could say that the taste and color of the architecture are different, but with this design, in order to get a byte representation, you need to create Bean.Builder, fill it out, then call the build () method. In order to change the bin, you need to create its builder through the toBuilder () method, change the value, and then call build ().
And that's all, only with every call to build () and toBuilder () all fields from an instance of one class are copied to an instance of another class. If all you need to do is get a byte array for serialization or change a couple of fields, then this copying is a big deal. In addition, in this method it seems ( I’m now figuring out ) that there is a long-term problem, which leads to the fact that even those fields whose values were not even set in the builder are copied.
You are unlikely to notice this if you have small bins with few fields. However, I inherited a whole library, the number of fields in individual bins of which reached three hundred. Calling the build () method for such a bin takes about 50 microseconds in my case, which allows processing no more than 20,000 bins per second.
The irony is that in my case, other tests show that saving a similar bean via Jackson / JSON is two to three times faster (if not all fields are initialized and most of the fields can not be serialized).
If you have a graph-like structure in which bins refer to each other, then I have bad news for you - protobuf is not suitable for serializing such structures. It stores bins by value, without tracking the fact that the bin has already been serialized.
In other words, if you have bean1 and bean2 that reference each other, then during serialization-deserialization you will get bean1, which refers to the bean3 bean; as well as bean2, which refers to the bean4 bean.
I am sure that in the vast majority of cases such functionality is not needed and is even contraindicated in simple DTOs. However, this problem also appears in more natural cases. For example, if you add the same bean to the collection 100 times, it will be saved all 100 times, and not once. Or you serialize a list of lots (goods). Each of the lots is a small bin with a description (quantity, price, date), as well as with reference to the leafy description of the product. If saved in the forehead, the product description will be serialized as many times as there are lots, even if all lots point to the same product. The solution to this problem is to save the products separately in the form of a dictionary, but these are additional actions - both during serialization and deserialization.
The described behavior is absolutely expected and natural for text formats such as JSON / XML. But here you expect from the binary format a little different, especially since the standard Java serialization in this regard works exactly as expected.
It is believed that protobuf is a super compact format. In fact, the compactness of serialization is provided by just a few factors:
And all this is wonderful, but only if we look at the DTO byte representation of the average (but I will not speak for everyone) modern service, we will see that most of the space will be taken up by strings, not primitives. These are logins, names, titles, descriptions, comments, resource URIs, and often in several versions (permissions for pictures). What does protobuf do with strings? In general, nothing special - just saves them to the stream as UTF-8. At the same time, remember that national characters in UTF-8 occupy two, or even three bytes.
Suppose an application generates data such that in a percentage in the byte representation, strings occupy 75% and primitives occupy 25%. In this case, even if our algorithm for optimizing primitives reduces the space required for their storage to zero, we will get a savings of only 1/4.
In some cases, compact serialization is very critical, for example, for mobile applications in poor / expensive conditions. In such cases, additional compression on top of protobuf is indispensable, otherwise we will be wasting chasing redundant data in rows. But then it suddenly turns out that a similar set of [JSON + GZIP] during serialization gives a slightly larger size compared to [PROTOBUF + ZIP]. Of course, the [JSON + GZIP] option will also consume more CPU resources at work, but at the same time, it is often also more convenient.
In protobuf version 3, there is a new Java Nano generation mode. It is not yet in the documentation, and the runtime of this mode is still in the alpha stage, but you can use it now using the "--javanano_out" switch.
In this mode, the generator creates anemic bins with public fields (without setters and without getters) and with simple serialization methods. There is no extra copying, so the problem # 2 is solved. The remaining problems remained, moreover, in the presence of circular references, the serializer falls into StackOverflowError.
The decision to serialize each field is based on its current value, and not on a separate bitmask, which somewhat simplifies the beans themselves.
An alternative implementation of the protobuf protocol. I didn’t test it in battle, but at first glance it looks very solidly. It does not require proto-files (however, it can work with them if necessary), therefore problems # 0, # 1 and # 2 have been resolved. In addition, it can save in its own format, as well as in JSON, XML and YAML. Also interesting is the ability to transfer data from one format to another stream, without the need for complete deserialization to an intermediate bin.
Unfortunately, if you send a regular POJO for serialization without a scheme, annotations, and without proto files (this is also possible), protostuff will save all the fields of the object in a row, regardless of whether they were initialized with a value or not, and this again has a big impact on compactness when not all fields are filled. But as far as I can see, this behavior can be corrected if desired by redefining a couple of classes.
Unfortunately, on one of the projects I had to come up against some features that are not mentioned in the advertising documentation, but strongly affect the technical characteristics of the project.
All of the following discussion applies only to the implementation of protobuf on the Java platform. Version 2.6.1 is also mainly described, although in the already released version 3.0.0, I also did not see any fundamental changes.
I also draw the fact that the article does not claim to be a complete review. About the good sides of the technology (for example, it is multilingual and excellent documentation) can be read on the official website . This article only talks about the problems and, perhaps, will allow you to make a more informed decision. One part of the problems relates to the format itself, another part of the problems relates to implementation. It is also necessary to clarify that most of the problems mentioned here occur under certain conditions.
A maven project with already connected dependencies for independent research can be taken on github .
0. The need for preprocessing
This is the smallest problem, I didn’t even want to include it in the list, but let it be mentioned for completeness. In order to get java code you need to run the protoc compiler. Some problem is that this compiler is a native application and on each of the platforms the executable file will be its own, so it will not work out by simply connecting the maven plugin. At a minimum, you need an environment variable on the development machines and on the CI server that will point to the executable file, and after that it can already be run from the maven / ant script.
Alternatively, you can do maven-pluging, which holds all the binaries in the resources and unpacks the necessary under the current platform into a temporary folder, from where it launches it. I don’t know, maybe someone already did that.
In general, there is little sin, therefore we will forgive.
1. Impractical code
Unfortunately, for the Java platform, the protoc generator produces very impractical code. Instead of generating neat anemic containers and serializers for them separately, the generator shoves everything into one big class with subclasses. Generated beans can neither be embedded in their hierarchy, nor can they even banally implement the java.util.Serializable interface to push the beans to the other side. In general, they are only suitable as highly specialized DTOs. If this suits you, then this is not a problem at all, just do not look inside.
2. Redundant copying - low productivity
Actually, here I started to have completely objective problems. The generated code for each entity described (let's call it “Bean”) creates two classes (and one interface, but it is not important in this context). The first class is an immutable Bean which is a read-only nugget of data, the second class is a mutable Bean.Builder, which can already be edited and set values.
Why is this done, it remains incomprehensible. Someone says that the authors are part of the sect of adherents of FP; someone claims that they tried to get rid of cyclic dependencies during serialization (how did this help them?); someone says that protobuf of the first version worked only with mutable classes, and stupid people shot themselves at the same time.
One could say that the taste and color of the architecture are different, but with this design, in order to get a byte representation, you need to create Bean.Builder, fill it out, then call the build () method. In order to change the bin, you need to create its builder through the toBuilder () method, change the value, and then call build ().
And that's all, only with every call to build () and toBuilder () all fields from an instance of one class are copied to an instance of another class. If all you need to do is get a byte array for serialization or change a couple of fields, then this copying is a big deal. In addition, in this method it seems ( I’m now figuring out ) that there is a long-term problem, which leads to the fact that even those fields whose values were not even set in the builder are copied.
You are unlikely to notice this if you have small bins with few fields. However, I inherited a whole library, the number of fields in individual bins of which reached three hundred. Calling the build () method for such a bin takes about 50 microseconds in my case, which allows processing no more than 20,000 bins per second.
The irony is that in my case, other tests show that saving a similar bean via Jackson / JSON is two to three times faster (if not all fields are initialized and most of the fields can not be serialized).
3. Loss of reference
If you have a graph-like structure in which bins refer to each other, then I have bad news for you - protobuf is not suitable for serializing such structures. It stores bins by value, without tracking the fact that the bin has already been serialized.
In other words, if you have bean1 and bean2 that reference each other, then during serialization-deserialization you will get bean1, which refers to the bean3 bean; as well as bean2, which refers to the bean4 bean.
I am sure that in the vast majority of cases such functionality is not needed and is even contraindicated in simple DTOs. However, this problem also appears in more natural cases. For example, if you add the same bean to the collection 100 times, it will be saved all 100 times, and not once. Or you serialize a list of lots (goods). Each of the lots is a small bin with a description (quantity, price, date), as well as with reference to the leafy description of the product. If saved in the forehead, the product description will be serialized as many times as there are lots, even if all lots point to the same product. The solution to this problem is to save the products separately in the form of a dictionary, but these are additional actions - both during serialization and deserialization.
The described behavior is absolutely expected and natural for text formats such as JSON / XML. But here you expect from the binary format a little different, especially since the standard Java serialization in this regard works exactly as expected.
4. Compactness in question
It is believed that protobuf is a super compact format. In fact, the compactness of serialization is provided by just a few factors:
- The var-int and var-long types are implemented and used by default, both signed and unsigned. Fields of these types can save space if the real values in these fields are small. In other words, if the distribution over the entire range of values is uneven and the bulk of the values are concentrated around zero. For example, when storing a value of 23L, it will only take one byte instead of eight. But on the other hand, if you save Long.MAX_VALUE, then this value will already occupy all ten bytes.
- Instead of full metadata (field names), only numeric field identifiers are stored. Actually, for this, we specify identifiers in proto-files and that is why they must be unique and unchanged. Identifiers are stored in fields of type var-int, so it makes sense to start them with exactly 1.
- Fields for which there was no setting of values through setters are not saved. To do this, when setting values via setters, protobuf also sets the bit corresponding to the field in a separate bitmask. It was not without problems, since when setting the value 0L, such a bit is still cocked, although it is obvious that there is no need to save such a field, since in most languages 0 is the default value. For example, when serializing Jackson, when he decides to serialize this field or not, he looks at the immediate value of the field.
And all this is wonderful, but only if we look at the DTO byte representation of the average (but I will not speak for everyone) modern service, we will see that most of the space will be taken up by strings, not primitives. These are logins, names, titles, descriptions, comments, resource URIs, and often in several versions (permissions for pictures). What does protobuf do with strings? In general, nothing special - just saves them to the stream as UTF-8. At the same time, remember that national characters in UTF-8 occupy two, or even three bytes.
Suppose an application generates data such that in a percentage in the byte representation, strings occupy 75% and primitives occupy 25%. In this case, even if our algorithm for optimizing primitives reduces the space required for their storage to zero, we will get a savings of only 1/4.
In some cases, compact serialization is very critical, for example, for mobile applications in poor / expensive conditions. In such cases, additional compression on top of protobuf is indispensable, otherwise we will be wasting chasing redundant data in rows. But then it suddenly turns out that a similar set of [JSON + GZIP] during serialization gives a slightly larger size compared to [PROTOBUF + ZIP]. Of course, the [JSON + GZIP] option will also consume more CPU resources at work, but at the same time, it is often also more convenient.
protoc v3
In protobuf version 3, there is a new Java Nano generation mode. It is not yet in the documentation, and the runtime of this mode is still in the alpha stage, but you can use it now using the "--javanano_out" switch.
In this mode, the generator creates anemic bins with public fields (without setters and without getters) and with simple serialization methods. There is no extra copying, so the problem # 2 is solved. The remaining problems remained, moreover, in the presence of circular references, the serializer falls into StackOverflowError.
The decision to serialize each field is based on its current value, and not on a separate bitmask, which somewhat simplifies the beans themselves.
protostuff
An alternative implementation of the protobuf protocol. I didn’t test it in battle, but at first glance it looks very solidly. It does not require proto-files (however, it can work with them if necessary), therefore problems # 0, # 1 and # 2 have been resolved. In addition, it can save in its own format, as well as in JSON, XML and YAML. Also interesting is the ability to transfer data from one format to another stream, without the need for complete deserialization to an intermediate bin.
Unfortunately, if you send a regular POJO for serialization without a scheme, annotations, and without proto files (this is also possible), protostuff will save all the fields of the object in a row, regardless of whether they were initialized with a value or not, and this again has a big impact on compactness when not all fields are filled. But as far as I can see, this behavior can be corrected if desired by redefining a couple of classes.