tensor_sbis April 19, 2017 at 12:50

How to make your C ++ code cross-platform?

Recovery mode

Perhaps someone, having read the title, will ask: “Why do something with your code ?! C ++ is a cross-platform language! ” In general, this is so ... but only so far there are no ties to the specific capabilities of the compiler and the target platform ...

In real life, developers who solve a specific task for a specific platform rarely ask themselves: “Does this exactly correspond to the C ++ Standard? What if it’s an extension of my compiler. ” They write code, start the build, and repair the places their compiler swore at.

As a result, we get an application that, to some extent, is “tailored” for a specific compiler (and even for its specific version!) And the target OS. Moreover, due to the scarcity of the C ++ standard library, some things are simply impossible to write without using the specific system API.

So it was with us in Tensor. We wrote on MS Visual Studio 2010. Our products were 32-bit Windows applications. And, of course, the code was riddled with all kinds of ties to technology from Microsoft. Once we decided that it was time to explore new horizons: it was time to teach VLSI to work under Linux and other OSs, it was time to try switching to another hardware (POWER).

In this series of articles, I will describe how we made our products real cross-platform applications; how they made them work on Linux, MacOS, and even on iOS and Android; how they launched their applications on a variety of hardware architectures (x86-64, POWER, ARM, and others); how to learn to work on big-endian machines.

The basis of all our products is our own VLSI Platform framework (hereinafter referred to as the "Platform"), which is comparable in scale to Qt. The platform has almost everything a developer needs: from simple functions for quickly converting a number to a string form to a powerful fault-tolerant application server.

On the basis of the Platform, our developers implement their products (even mobile applications) that solve all kinds of business problems. We wanted to free their code (hereinafter we will call their code “applied”) from all kinds of ties to the target software and hardware platform, hiding all the specifics in the depths of our framework.

The VLSI platform is written in C ++, but this does not limit the application programmer in choosing a language, in addition to C ++ JavaScript, Python, and SQL can be used.

Our company is actively developing its products, so it was necessary to "repair the train at full speed" :)

It was necessary to work so that other developers would not suffer from our activities and comfortably continue to develop their Windows functionality on MSVC. This requirement greatly influenced many technical solutions and greatly complicated the work.

In order for the reader to form an idea of the scale of the work, I will give some numbers:

The code volume of our framework is ~ 2 million lines
The volume of “applied” code (code based on the VLSI platform that solves specific business problems) is difficult to evaluate, but it is several times larger than the volume of the Platform
Over a thousand programmers in ten development centers

The boring entry is over. Now let's get closer to the matter and consider what problems we encountered.

Using the operating system API

As mentioned above, the C ++ standard library is very meager; it does not include many of the necessary features everywhere. For example, in C ++ 11 there is no functionality for working with the network ... That is, as soon as we wanted to make the simplest HTTP request, we were forced to ... write a non-platform code!

The situation is even more aggravated if you are not using the latest version of the compiler, as we had - in MSVS 2010 disgusting support for C ++ 11, there is a huge part of the innovations in the core language and standard library.

But, fortunately, such problems can be solved quite easily. There are several ways:

We are writing our own class, with several platform-specific implementations based on API calls of the target system. During assembly, ifdef preprocessor directives select the appropriate implementation.
We use cross-platform libraries - there are many ready-made cross-platform libraries (again, using platform-specific implementations within themselves) that greatly simplify our task. For example, to implement the HTTP client, we took cURL.

Features of compiler implementations

Each program has errors. And the compiler is no exception. Therefore, even 100% standard-compliant code may not be compiled on some kind of compiler.

Also, almost all compiler developers consider it their duty to add capabilities not provided for in the Standard to their offspring, and thereby provoke programmers to write intolerable code.

What do we get in the end? Code that is written clearly according to the Standard may not be compiled on some compiler; code that compiles and runs on one compiler may not be compiled or may not work on the other ...

You can list many problems of this class. Here is one of them:

throw std::exception( "что-то пошло не так" ); // соберется только в MSVC++, так как по стандарту нет такого конструктора

This code will be compiled in MSVC ++, since they have an additional constructor defined:

exception( const char* msg ) noexcept;

Unfortunately, there are no general tricks to solve such problems. In these cases, only the experience gained in studying the tools used in the work and good knowledge of the C ++ Standard help.

In subsequent articles, I will return to this issue, describe in detail the most common problems and propose methods for solving them.

Undefined behavior

C ++ Standard has the interesting term “undefined behavior”. Here is his definition from Wikipedia:

Undefined behavior (English undefined behavior, unpredictable behavior [1] [2] in a number of sources) is the property of some programming languages (most noticeable in C), software libraries and hardware in certain marginal situations to produce a result that depends on the implementation of the compiler (library chips) and random factors like memory status or triggered interrupt. In other words, the specification does not determine the behavior of the language (library, microcircuit) in any possible situations, but says: "under condition A, the result of operation B is not defined." To allow such a situation in the program is considered a mistake; even if the program runs successfully on some compiler, it will not be cross-platform and may fail on another machine, in a different OS, or with different compiler settings.

If you allow undefined behavior in your program, this does not mean at all that it will crash or throw any errors to the console. Such a program may well work as expected. But any change in the compiler’s settings, switching to another compiler or to another version of the compiler, or even modification of any piece of code can change the behavior of the program and break everything!

Many situations with undefined behavior on one specific compiler produce stably the same behavior, and your carefully tested application will work like a Swiss watch. But as soon as we change the environment (for example, trying to run a program compiled by another compiler), these bugs begin to declare themselves and completely break the program.

The classic example of undefined behavior is to go beyond the bounds of an array on the stack. The following is a simplified code snippet of one of our applications with this problem. This bug did not manifest itself under Windows for several years and only "shot" after porting to Linux:

std::string SomeFunction()
{
   char hex[9];
   // some code
   hex[9] = 0; // тут выход за границы массива
   return hex;
}

Apparently, MSVS was aligning the buffer on the stack, adding a few bytes after it, and when overwriting someone else's memory, we ended up in an empty, unused place. And in GCC, the problem began to appear in an interesting way - the program fell far from this code, in another function (apparently, GCC inlined this function, and it began to rewrite local variables of another function).

There are more elegant, elusive situations with UB. For example, very interesting rakes can be stepped when using std :: sort:

std::vector< std::string > v = some_func();
std::sort( v.begin(), v.end(),
   []( const std::string& s1, const std::string& s2 )
{
   if( s1.empty() )
      return true;
   return s1 < s2;
} );

It would seem, where can there be UB? And the thing is in the “bad” comparator.
The comparator should return true if s1 needs to be placed before s2. Consider what our comparator will produce if two empty lines are input to it:

s1 = "";
s2 = "";
cmp (s1, s2) == true => s1 should be in front of s2
cmp (s2, s1) == true => s2 should be in front of s1

Thus, there are situations when the comparator contradicts itself, that is, does not specify strict weak ordering (link to en.wikipedia.org/wiki/Weak_ordering#Strict_weak_orderings ). Therefore, we violated the requirements of std :: sort for arguments and got undefined behavior.

And this is not an invented example. We caught this problem when switching to Linux. The comparator with a similar error worked for many years under Windows and ... began to crash the application with SIGSEGV under Linux (i686). Interestingly, the bug behaves differently even on different Linux distributions (with different GCCs on board): somewhere the application crashes, somewhere freezes, somewhere it simply sorts not as expected.

Often situations with undefined behavior can be caught by static analyzers (including those built into the compiler). Therefore, in the build settings you should always set the maximum level of warnings. And in order not to lose a useful warning in the crowd of warnings of the “unused variable” type, it’s useful to clean up the code once, and then enable the “treat warnings as errors” assembly option to prevent new unnoticed warnings from appearing.

Data models

The C ++ standard does not provide any strict guarantees on the representation of data types in computer memory; it defines only some relations (for example, sizeof (char) <= sizeof (short) <= sizeof (int) <= sizeof (long) <= sizeof (long long)) and provides methods for determining the characteristics of types.

On different systems, the way that types are represented can vary significantly. The dimensions of the base types are defined by the data model. A data model should be understood as the ratio of the dimensions of types adopted within the framework of the development environment. The table below lists the popular data models and shows the corresponding dimensions of the main C ++ types.

In the vast majority of cases, a programmer, when choosing a data type, needs guarantees about its size. But in practice, developers are often simply tied to the sizes of the basic types in the system on which they work. And again, when we switch to another software or hardware platform, we get surprises: some code stops compiling, some one starts working differently or completely stops working.

For example, the hash function below will produce different results on the same data when launched on different platforms:

unsigned long some_hash( const unsigned char* buf, size_t size )
{
    unsigned long res = 0;
    for( size_t i = 0; i < size; ++i )
        res = res * buf[i] + buf[i] + i;
   return res; 
}

Most of these problems can be resolved using types with a guaranteed size:

std::int8_t, std::int16_t и т. п.
std::uint32_t some_hash( const unsigned char* buf, size_t size )
{
    std::uint32_t res = 0;
    for( size_t i = 0; i < size; ++i )
        res = res * buf[i] + buf[i] + i;
   return res; 
}

Char sign

I guess not many developers have wondered if char is a sign. And if such a question arose, then most would open their favorite development environment, write a small test program and get an answer ... true only for their system.

In fact, the C ++ Standard does not specify the char character of char. Because of this, there are implementations of compilers in which char is signed, and there are those where char is unsigned. And this is another reason because of which your program may refuse to work after assembly for another system.

For example, this code works as expected on Linux x86-64, but does not work on Linux POWER (when building in GCC with default parameters):

bool is_ascii( char s )
{
   return s >= 0;
}

To get rid of uncertainty, just add an explicit cast to the desired type:

bool is_ascii( char s )
{
   return static_cast( s ) >= 0;
}

and in our example, we can completely rewrite the code for bit operations:

bool is_ascii( char s )
{
   return s & 0x80 == 0;
}

String view

The C ++ standard does not regulate certain aspects in any way, and each compiler resolves these issues at its discretion.

For example, there is no guarantee how string constants will be represented in memory.
The MSVS compiler encodes string constants in Windows-1251, while GCC encodes string constants in UTF-8 by default.

Because of such differences, the same code will give different results: strlen ("Habr") in the program built on MSVS will give 4; in GCC - 8. The

same problems will come with the input and output of data. For example, our test program can save and read data in some text files:

std::string readstr()
{
   std::ifstream f( "file.txt" );
   std::string s;
   std::getline( f, s );
   return s;
}

void writestr( const std::string& s )
{
  std::ofstream f( "file.txt" );
  f.write( s.c_str(), s.size() );
}

Everything will work well, as long as these files are written and read by applications collected in one environment. But what will happen if this file is written to a Windows application and reads the application under Linux? .. We will get "krakozyabry" :)

What to do in such cases? There is only one general principle of possible solutions - to choose some unified way of representing strings in the program memory and, when I / O, do explicit encoding / decoding of strings. Many developers use UTF-8 encoding in their programs. And this is a very good solution.

But, as I mentioned above, we “repaired the train at full speed”, and we could not break some of the invariants that our code relied on (it was developed taking into account that the string encoding is Windows-1251):

fixed character width - random access to a character by its index is possible
it is possible to write string constants in Russian in code

In UTF-8 encoding, characters can be represented by a different number of bytes, which is why it does not satisfy the first requirement. The second requirement in the case of UTF-8 is not met, for example, in MSVC 2010, where string constants are encoded in Windows-1251. Therefore, we had to abandon UTF-8, and we decided ... to completely disengage from the encoding in which the lines are presented, and switched to “wide strings”.

This solution almost completely satisfied our requirements:

In almost all UNIX systems, “wide lines” are represented by UTF-32 encoding, that is, the width of characters in it is fixed and matches the size of an element of type wchar_t
Windows uses UTF-16. With this encoding, the situation is somewhat more complicated, since some characters can be represented by surrogate pairs. But, fortunately, everything that is in Windows-1251, on which our Windows application worked, is represented by double-byte sequences. Therefore, at the initial stage, we did not at all support surrogate pairs and made the assumption that under Windows all characters fit into one element of the wchar_t type.
In C ++, you can specify "wide" string constants, for example, L "Hello, Habr!" In this case, the compiler takes care of transcoding this line from the encoding of the source file to the encoding in which wchar_t is represented on the target system.

In addition, when using "wide lines" we got a number of advantages:

The standard C and C ++ libraries have many functions and classes for working with "wide strings" - there is no need to write your own analogs to the strlen, strstr functions, classes std :: string, std :: stringstream, etc.
Many third-party libraries support wide strings (for example, BOOST)
Most WinAPI can work with "wide lines"

On all platforms we need, “wide characters” are represented by Unicode. Due to this, our applications are no longer limited to the Latin alphabet and Cyrillic alphabet, they support any language in the world.

In fact, fighting encodings was the hardest part of porting our products. You can tell a lot more about her - let's leave this for the next articles :)

OS file system features

The Windows file system has several differences from most FS UNIX-like systems:

She is case insensitive
It allows you to use the character "\" as a path separator

What does this lead to? You can name your header file “FiLe.H”, and in the code write “#include". On Windows, this code compiles, but on Linux you get an error that a file named “myfolder \ file.h” was not found.

But, fortunately, to avoid such problems is very simple - just accept the rules for naming files (for example, name all files in lower case) and stick to them, and always use “/” as path separators (Windows also supports it).

To completely eliminate annoying errors, we put a simple hook on our git repositories that checks if the include directives are compliant with these rules.

Also, the features of the FS affect the application itself. For example,

std::string root_path = get_some_path();
std::string path = root_path + '\\' + fname;

If you have code that “sticks” paths through normal string concatenation operations and uses “\” as delimiters, then it will break, since under some operating systems the delimiter will be interpreted as part of the file name.

Of course, you can use '/', but on Windows it looks ugly, and in general there are no guarantees that there will be no OS in which some other separator will be used.

To solve this problem, we use the boost :: filesystem library. It allows you to correctly form the path for the current system:

boost::filesystem::path root_path = get_some_path();
boost::filesystem::path path = root_path / fname;

Conclusion

Development of cross-platform software in C ++ is a non-trivial task. It is perhaps impossible to write a program that will work on various software and hardware platforms without making any additional efforts for this. And it is impossible to develop a large program in C ++ that can be correctly assembled on any compiler for any OS and any hardware without changes, despite the fact that C ++ is a cross-platform language. But if you adhere to a number of rules, which I briefly set out in the article, then you can write code that runs on all the platforms you need. Yes, and transferring this program to a new OS or hardware will no longer be so difficult.

Total, to write cross-platform code you need:

It’s good to know the C ++ Standard, understand what is allowed in it, and what is an extension of a particular compiler or even leads to undefined behavior.
Refuse to use the system API in the code by encapsulating platform-specific code in some classes or use ready-made cross-platform libraries.
Take into account possible typing differences, do not depend on the properties of basic types that are not guaranteed by the C ++ Standard. For this, you can use types with a fixed dimension from the standard C ++ library.
Decide on the format for representing strings in program memory. There may be many options. For example, use UTF-8, as is done in many programs, or switch to “wide” lines altogether, abstracting from the format of presenting lines at all.
Take into account the features of file systems on different operating systems (both in code, in #include directives, and in the logic of the program itself).

Author: Alexey Konovalov

Tags: