Protocol Buffers Reverse Engineering
By reverse engineering, in this context, I mean restoring the original message scheme closest to the original used by the developers. There are several ways to get what you want. Firstly, if we have access to the client application, the developers did not take care to hide the debugging symbols and link to the LITE version of the protobuf library, then it would be easy to get the original .proto files. Secondly, if developers use the LITE library assembly, this of course complicates the life of the reverser, but it does not make reversing a useless activity: with some skill, even in this case, you can restore .proto files quite close to the original.
In this article, I would like to describe some techniques for reversing ptobobuf messages, thanks to which my protodec project appeared. I note that everything said relates to the encoding format of protobuf messages of version 2 (version 3 is not yet supported, packed fields, too).
To start, I will create objects for research. We need 2 files:
We save them and put them all together. If you do not know what protoc is, then you need to read the introduction to the Protobuf library for your programming language.
Delete or comment out the second line of the addressbook.proto file and execute the command:
After executing the above commands, we have two executables, tut.lite.exe and tut.exe, with LITE and a complete assembly of the libprotobuf library, respectively. Both programs do the same thing: a protobuf message is created, which is displayed in std :: cout. We also got two binary files with the names A and B. The first is generated by the lite version, the second by the full version of the program. Their contents are identical. In the screenshot below you can see the binary representation of this message and its text view:
Delete addressbook.proto and try to restore it.
Take a look at the contents of the adressbook.pb.cc file generated earlier by the protoc utility. We should be interested in the protobuf_AddDesc_addressbook_2eproto function. One of the first actions in it is to call the function :: google :: protobuf :: DescriptorPool :: InternalAddGeneratedFile, the first argument of which is a Descriptor protobuf message with information about the structure of the original messages.
It stores information about enumerations, an import list, messages, names and data types of their fields, etc. The format is not a secret and comes with source code; it can be looked at google / protobuf / descriptor.proto. This data is used for reflection, for debugging output of message contents, etc.
The protodec utility searches for Descriptor data in a binary file and can save .proto files recovered from them. To do this, run the command:
In response, we will see something like this:
That is, in the end we got almost the original source .proto file.
If there is no access to the application (for example, it works somewhere on the server), then data will also be difficult to get to Descriptor. The same applies if the application is built with LITE optimization: reflection is not used, therefore the Descriptor description of .proto files is not generated at the compilation stage, and therefore we cannot restore the original .proto files using the method mentioned above. In this case, you can try to analyze the contents of protobuf messages. I note that they must be 100% have the same structure (the root message must be the same for them). We will need such messages as much as possible; the more data they have, the better the result will be in the end.
The protodec program can restore the scheme of a specified protobuf message with their types loaded from a file. To do this, run the command:
This conclusion means that in this protobuf message (downloaded from file A), 3 messages were detected. If we take a look at the original addressbook.proto, we can certainly guess the general: MSG1 is Person :: PhoneNumber, MSG2 is Person, and MSG3 is AddressBook. I will describe striking discrepancies:
Names of both fields and messages are generated automatically, it is impossible to get these metadata from the body of the protobuf message itself, because they are simply not there. In this case, you can gradually rename messages and fields when their purpose becomes more or less clear from the context of the messages being studied. Also, in the application itself, in the export list you can sometimes find this information. For this we need any utility that can do this, for example, IDA. Here, here we fished out the names and field order for the tutorial :: Person message, which has 4 fields:
We do the same for the rest of the messages and as a result we get almost the original .proto-file.
As a result, we got something like this .proto file:
We will write a small program to check that our restored circuit can edit the original messages.
Compile and run:
In this article, I would like to describe some techniques for reversing ptobobuf messages, thanks to which my protodec project appeared. I note that everything said relates to the encoding format of protobuf messages of version 2 (version 3 is not yet supported, packed fields, too).
Training
To start, I will create objects for research. We need 2 files:
addressbook.proto
package tutorial;
option optimize_for = LITE_RUNTIME;
message Person {
required string name = 1;
required int32 id = 2;
optional string email = 3;
enum PhoneType {
MOBILE = 0;
HOME = 1;
WORK = 2;
}
message PhoneNumber {
required string number = 1;
optional PhoneType type = 2 [default = HOME];
}
repeated PhoneNumber phone = 4;
}
message AddressBook {
repeated Person person = 1;
}
tut.cpp
#include
#include
#include
#include "addressbook.pb.h"
int main() {
GOOGLE_PROTOBUF_VERIFY_VERSION;
tutorial::AddressBook book;
tutorial::Person * person = book.add_person();
person->set_id(1234);
person->set_name("John Doe");
person->set_email("jdoe@example.com");
tutorial::Person_PhoneNumber * phone = person->add_phone();
phone->set_number("555-4321");
phone->set_type(tutorial::Person_PhoneType_HOME);
std::string data = book.SerializeAsString();
assert(!data.empty());
std::cout.write(&data[0], data.size());
google::protobuf::ShutdownProtobufLibrary();
}
We save them and put them all together. If you do not know what protoc is, then you need to read the introduction to the Protobuf library for your programming language.
protoc --cpp_out=. addressbook.proto && g++ addressbook.pb.cc tut.cpp `pkg-config --cflags --libs protobuf` -s -o tut.lite.exe && ./tut.lite.exe > A
Delete or comment out the second line of the addressbook.proto file and execute the command:
protoc --cpp_out=. addressbook.proto && g++ addressbook.pb.cc tut.cpp `pkg-config --cflags --libs protobuf` -o tut.exe && ./tut.exe > B
After executing the above commands, we have two executables, tut.lite.exe and tut.exe, with LITE and a complete assembly of the libprotobuf library, respectively. Both programs do the same thing: a protobuf message is created, which is displayed in std :: cout. We also got two binary files with the names A and B. The first is generated by the lite version, the second by the full version of the program. Their contents are identical. In the screenshot below you can see the binary representation of this message and its text view:
Delete addressbook.proto and try to restore it.
Recovering Message Schema from Descriptor Executable File Data
Take a look at the contents of the adressbook.pb.cc file generated earlier by the protoc utility. We should be interested in the protobuf_AddDesc_addressbook_2eproto function. One of the first actions in it is to call the function :: google :: protobuf :: DescriptorPool :: InternalAddGeneratedFile, the first argument of which is a Descriptor protobuf message with information about the structure of the original messages.
// ...
void protobuf_AddDesc_addressbook_2eproto() {
static bool already_here = false;
if (already_here) return;
already_here = true;
GOOGLE_PROTOBUF_VERIFY_VERSION;
::google::protobuf::DescriptorPool::InternalAddGeneratedFile(
"\n\021addressbook.proto\022\010tutorial\"\332\001\n\006Person"
"\022\014\n\004name\030\001 \002(\t\022\n\n\002id\030\002 \002(\005\022\r\n\005email\030\003 \001("
"\t\022+\n\005phone\030\004 \003(\0132\034.tutorial.Person.Phone"
"Number\032M\n\013PhoneNumber\022\016\n\006number\030\001 \002(\t\022.\n"
"\004type\030\002 \001(\0162\032.tutorial.Person.PhoneType:"
"\004HOME\"+\n\tPhoneType\022\n\n\006MOBILE\020\000\022\010\n\004HOME\020\001"
"\022\010\n\004WORK\020\002\"/\n\013AddressBook\022 \n\006person\030\001 \003("
"\0132\020.tutorial.Person", 299);
::google::protobuf::MessageFactory::InternalRegisterGeneratedFile(
"addressbook.proto", &protobuf_RegisterTypes);
Person::default_instance_ = new Person();
Person_PhoneNumber::default_instance_ = new Person_PhoneNumber();
AddressBook::default_instance_ = new AddressBook();
Person::default_instance_->InitAsDefaultInstance();
Person_PhoneNumber::default_instance_->InitAsDefaultInstance();
AddressBook::default_instance_->InitAsDefaultInstance();
::google::protobuf::internal::OnShutdown(&protobuf_ShutdownFile_addressbook_2eproto);
}
// ...
It stores information about enumerations, an import list, messages, names and data types of their fields, etc. The format is not a secret and comes with source code; it can be looked at google / protobuf / descriptor.proto. This data is used for reflection, for debugging output of message contents, etc.
The protodec utility searches for Descriptor data in a binary file and can save .proto files recovered from them. To do this, run the command:
protodec --grab tut.exe
In response, we will see something like this:
That is, in the end we got almost the original source .proto file.
Recovering a scheme from message bytes
If there is no access to the application (for example, it works somewhere on the server), then data will also be difficult to get to Descriptor. The same applies if the application is built with LITE optimization: reflection is not used, therefore the Descriptor description of .proto files is not generated at the compilation stage, and therefore we cannot restore the original .proto files using the method mentioned above. In this case, you can try to analyze the contents of protobuf messages. I note that they must be 100% have the same structure (the root message must be the same for them). We will need such messages as much as possible; the more data they have, the better the result will be in the end.
The protodec program can restore the scheme of a specified protobuf message with their types loaded from a file. To do this, run the command:
protodec --schema A
This conclusion means that in this protobuf message (downloaded from file A), 3 messages were detected. If we take a look at the original addressbook.proto, we can certainly guess the general: MSG1 is Person :: PhoneNumber, MSG2 is Person, and MSG3 is AddressBook. I will describe striking discrepancies:
- Field MSG3.fld1 must be repeated. The problem here is that in the original message, in AddressBook.person there is only one element, and at the binary level it is impossible to distinguish the repeated field in this case. If in AddressBook.person, there were at least 2 elements of data, then it would be determined correctly. That is why we need several messages of this scheme, with maximum occupancy;
- Some required fields must be optional. This problem is also solved by analyzing a large number of messages, thanks to which you can understand where the required field should be, and where optional;
- The MSG2.fld2 field must be int32, and it is int64. At a low level, in protobuf all integer types (int32, int64, uint32, uint64, sint32, sint64, bool, enum) are stored as Varint. Then you can understand from the context whether the numbers in this field will be signed or unsigned, int64 is selected so that it can store the maximum possible integer value for the programming language used.
Names of both fields and messages are generated automatically, it is impossible to get these metadata from the body of the protobuf message itself, because they are simply not there. In this case, you can gradually rename messages and fields when their purpose becomes more or less clear from the context of the messages being studied. Also, in the application itself, in the export list you can sometimes find this information. For this we need any utility that can do this, for example, IDA. Here, here we fished out the names and field order for the tutorial :: Person message, which has 4 fields:
We do the same for the rest of the messages and as a result we get almost the original .proto-file.
Check
As a result, we got something like this .proto file:
tut2.proto
package ProtodecMessages;
message PHONE {
required string Number = 1;
required int64 Type = 2;
}
message PERSON {
required string Name = 1;
required int64 Id = 2;
required string Email = 3;
required PHONE Phone = 4;
}
message ADDRESSBOOK {
repeated PERSON Person = 1;
}
We will write a small program to check that our restored circuit can edit the original messages.
tut2.cpp
#include
#include
#include
#include
#include "tut2.pb.h"
int main() {
GOOGLE_PROTOBUF_VERIFY_VERSION;
// читаем содержимое protobuf сообщения из std::cin
std::string data;
ProtodecMessages::ADDRESSBOOK book;
while (std::cin.peek() != EOF)
data.push_back((char)std::cin.get());
// все ли удачно распарсили?
assert(book.ParseFromString(data));
assert(book.person_size() > 0);
// изменяем сообщение
ProtodecMessages::PERSON * person = book.mutable_person(0);
person->set_email("fake@name.com");
person->set_id(4321);
// выводим измененное сообщение в std::cout
data = book.SerializeAsString();
assert(!data.empty());
std::cout.write(&data[0], data.size());
// Optional: Delete all global objects allocated by libprotobuf.
google::protobuf::ShutdownProtobufLibrary();
}
Compile and run:
protoc --cpp_out=. tut2.proto && g++ tut2.pb.cc tut2.cpp `pkg-config --cflags --libs protobuf` -o tut2.exe