khim October 17, 2014 at 16:08

Once again about indefinite behavior or "why you should not hammer in nails with a chainsaw"

About indefinite behavior have been written more than once . Quotations from standards, explanations of their interpretation, various instructive examples were cited, but it seems that all the people who tried to write about this missed an important point: in my opinion no one distinctly bothered to explain - where did this concept come from, in fact, and, most importantly, to whom it is addressed.

Although in fact, if you recall the history of C, everything is quite obvious and, most importantly, logical. And all the complaints of people who were “burnt” by indefinite behavior for people who have not forgotten what C is and whyit even exists, it sounds something like: “I hammered nails with a chainsaw ... hammered and hammered, everything was fine, and then I pulled the handle and her claws ran in, pulled, my arm was chopped off and was full ... well, who is building it like that?”

People who know what a chainsaw are trying to, of course, explain what kind of a jerk it is to pull, then it should be, in general, but people who think that they have such a hammer in their hands say “by” them, and, as a result, all remain with their own.

So what important secret do people overlook?

The answer is simple: they forget why this language exists at all and what tasks it solves.

Let's remember what C is all about and why.it is, in fact, invented. Everyone knows that this is a low-level language. Everyone knows that this is not a portable language (this is not Java, yes). But many people forget (or simply do not think) about what kind of language it is and why it was, in fact, invented. Rather, everyone knows some basic facts (oh, yes - UNIX, K&R - oh, of course, ANSI and ISO - I heard somewhere), but the simplest consequences of these facts are somehow completely overlooked. Which leads to an absolutely terrible misunderstanding of the place of C (and the ANSI C and ISO C standards) in this world.

So here. The C language was created, in general, for one single purpose: to write the UNIX operating system. That is low level code. But wait a minute, if you follow the link to Wikipedia, you will find that UNIX is not only multi-tasking and multi-user, but it is also portable .

And it is precisely this combination that leads us to indefinite behavior.

Indeed: how can one make a program portable? The obvious option: write it in a portable language, in the description of which describe in detail everything that different operators do, how overflow is handled, access to an uninitialized pointer, and so on and so forth. Well, as Java or ECMAScript does. But this, as you know, is inefficient. You will have to insert in the heap of places in the verification code, some places will become terribly ineffective “out of the blue” (shifting 1 by 129 on x86 gives 2, and on ARM gives 0, so no matter how you determine the value of the expression “ 1 << 129 ”you will have to generate inefficient code on one of the processors), etc. etc. Not the best option for the OS kernel. Anyway - what is this “low-level language” where the simplest things are translated into 10 assembler instructions (and it’s good if they are in 10, not 100)?

The UNIX (and C) developers went the other way. Language they did notportable, but the UNIX written on it is very portable. How did you achieve this? Bans. Shift to very large values on different processors gives different results? So you cannot use such shifts in the program . Comparing pointers leading to different intersecting segments in memory is expensive and difficult (remember 8086 )? So you can’t use such comparisons in programs . Does one processor throw exceptions when it overflows, and the other silently generates a negative number? This means that you cannot use such a construction in a program . Etc. etc. There are tens and hundreds of such rules designed to ensure portability of the code.

Note: these rules are programmer oriented ., by no means to the compiler developer. This programmer needs portable code, not a compiler. And for the compiler - these rules, on the contrary, provide scope for implementation. If addition gives rise to negative numbers, and subtraction throws an exception, then what should we do? It doesn’t matter: the programmer must still make sure that this does not happen! This is what leads to the fact that UNIX (and its "ideological successor" Linux) support an absolutely incredible number of platforms. And many programs written in this intolerable language also work in a much larger number of places than programs written "in truly portable languages" such as C # or Java. The fact is that portability in the case of using C # or Java rests primarily with the compiler, but in C it is not at all with him, but even with the programmer.

After C began to be used in a bunch of different places besides the UNIX kernel, people gathered and wrote a standard (in fact, not one: there are several, but they are based on each other). And an important part of the work on this standard was the collection and classification of all these prohibitions placed on the programmer . Otherwise, one could not achieve portability from one compiler to another.

For example: is it possible to use pointers that do not indicate the “inside” of the array? Generally speaking, you can’t, as there are all sorts of iAPX 432 where all this is controlled by hardware. But sometimes I really want to. Remember begin and endfrom C ++. Half Interval - It’s convenient. With a gnashing of teeth, it was decided that yes, all implementations should provide one element “from above”. Or rather, not the element itself, but its address: of course, it is impossible to access the element itself lying "beyond the edge of the array", but you can take its address. Well, where it is not possible in hardware - let the compiler add one “invisible” element. Well, in general, there were a lot of different solutions, where it was discussed - which prohibitions the programmer can still “survive”, and which ones are already “no way”.

True, some of these prohibitions were "on the brink." It turned out, for example, that one could not just take and convert intto unsigned int. Well, because there is additional code and reverse code. Then the standardization committee decided: "No, well, it's time and measure to know, these are some completely crazy restrictions here we impose on the programmer." A “Solomon decision” was made: to allow implementations to choose one of the options (sometimes explicitly listed in the standard, sometimes not), but to oblige the implementation to always use the same approach. That is, the implementation may be with additional code or with reverse code , but only one option will be accurate. And in any case, you can convert the number from intto unsigned intand get ... well, at least something.

The standard says even a little more

The positive numbers presented in both cases are guaranteed to be preserved, which makes life easier.

But such options (they are called " the behavior is implementation-defined aka" implementation part-defined behavior ") is not enough. In general prohibitions (known as the" undefined behavior "aka" undefined The behavior ") were left in the standard and the duty to struggle with the consequences laid on the programmer : if a programmer wants his C program to work on different compilers, then he must take care that the trouble does not happen, but the developers of the compiler can do anything in such cases (this is, in fact, the whole point).

By the way there is another close concept

Unspecified behavior aka unspecified behavior is when several options are possible and the compiler in each particular case is free to choose one of them (say the standard does not determine which function is calculated first in the expression f()+g()and in each specific case the compiler is free to choose a more convenient option for him, but he must first calculate both of them, and only then add the results). The goal is the same: to facilitate the work of compiler developers.

Note that in many cases it is not so easy to understand whether the ban is violated or not. For example, in the C standard it is forbidden to use two identical pointers, pointing to objects of different types that are used interspersed. And why, actually? Yes, it’s very simple: remember the 8086 and 8087 processors . They work in parallel and, to a certain extent, independently. So if you do not specifically insert synchronization commands, then a well-known trick, when a type number is floatinterpreted as a type number int(or longon 16-bit compilers) may not work! The desired number (which the 8087 processor should generate ) will simply not be in memory when the 8086 processor "arrives" there!

Note: it is forbidden not to have two pointers to objects of different types (otherwise there would be no sense in associations ), it is forbidden to use them “mixed”: if put int, then take it out already int, and if put it in , then floatyou need to take it out float. It is clear that to check such prohibitions is extremely difficult, and it is not necessary - this is not a restriction on the compiler, but on the programmer! It is his duty, not the compiler.

A correct C program is required to comply with these prohibitions - otherwise no portability will work! Well, since the program never violates these prohibitions and neverdoesn’t cause indefinite behavior, it’s a sin not to use it to speed up the program, is it? For example, you can declare a variable pbefore calling realloc'a and after calling realloc' a with different virtual variables. I hope you understand why this is necessary: well, for example, you can put one "virtual variable" in the register even if the address of the other is passed somewhere ... they are different, they do not affect each other!

And then ... we have what we have:

#include 
#include 
int main() {
  int *p = (int*)malloc(sizeof(int));
  int *q = (int*)realloc(p, sizeof(int));
  *p = 1;
  *q = 2;
  if (p == q)
    printf("%d %d\n", *p, *q);
}
$ clang -O realloc.c ; ./a.out 
1 2

But ... this is terrible, Kachmar, how can this be at all? And here is this: once you have caused realloc, then you no longer are two variables in the program, but three: p₁, p₂and q. At the same time, the variable p₂did not receive any value that could force it to point to the portion of memory that it points to q(it, in fact, received no value, which the compiler can use - but it is not obliged). Well, and therefore values *p₂and *qalso different. And we know which ones: *p₂points to 1, but qto 2. So we can immediately transfer these values to printf. Saving! But for p₂ == qsome reason, the verification falsecould not be optimized, but oh well - maybe the next version of the compiler can handle it.

Note that this result is generated after long chains of transformations, each of which is meaningful and logical. Assuming, of course, that the programmer complies with the "rules of the game" and creates a program that does not violate the "prohibitions." Well, if he violated them ... well, what can I say - SSZB , it was necessary to “smoke” the documentation more thoughtfully.

That’s all, in fact: an absolutely natural proposal (for people who use C for one platform) to implement operations since they correspond to common sense, namely, the properties of the target platform turn out to be unnatural, if we proceed from the main purpose of the C language : creating fast, low-level, but portableprograms. For portable programs, this is not necessary simply because such behavior should never occur in their code at all. Well, if you use a chainsaw to drive nails, then ... take care of your fingers!

And for exactly the same reason, the question can lead to an erroneous pseudo-code that could theoretically be created by the compiler? also meaningless: what kind of pseudocode? There will be no pseudo code there! We are discussing not just a compiler, but a well optimizing compiler! He not only can , but he is obliged to simply throw this code to hell with dogs: if the code is requiredcauses undefined behavior, then, in a correctly written C program, it is never called and, therefore, it can be thrown out - which will obviously only reduce the size of the program, and maybe speed it up a bit. The sheer gain!

Of course, in cases of specific compilers, developers may decide that they will not require the implementation of part of these rules. For example, GCC offers as many as three ways to handle signed number overflows: default behavior (as described in the standard - when it can happily turn an endless loop into an endless loop ), -fwrapv mode , when overflow using additional code is supported, and -ftrapv modewhen these cases are caught and an exception is thrown. And for cases when you need to write code in which different types can be mixed there is not only -fno-strict-aliasing , but also an attribute of type may_alias . And in general there are a lot of things. But this is all - extensions to the standard, which, in most cases, need to be explicitly included .

If you thoroughly know the behavior of your particular processor and want to use its features, then, alas, I have to disappoint you: you not only have no right to expect behavior close to hardware, but on the contrary, if the behavior of your processor is different from any other, then with a probability of 99% it is strictly forbidden to use this feature in C. And this is not an oversight of the developers of a compiler, but quite the contrary - a direct consequence of the basic principle underlying this language.

Tags:

Once again about indefinite behavior or "why you should not hammer in nails with a chainsaw"

Also popular now: