Undefined behavior! = Unsafe programming
- Transfer
From a translator: I
bring to your attention the translation of two posts from the John Regehr blog. I decided to combine them in one publication because, firstly, they have a small volume, and secondly, the second post is a continuation of the first, and is a response to the comment on the first post on Hacker News.
Link to the first post
Link to the second post
Undefined behavior (UB) in C and C ++ is a danger to developers, especially if the code works with untrusted data. It is less known that undefined behavior exists in the intermediate representation (IR) of most AOT optimizing compilers. For example, LLVM IR has undef and “poisoned” values in addition to the explosive UB of C. When people start to worry about it, a typical reaction is: “What? LLVM IR is just as bad as C! ”This article explains why this is so wrong.
Undefined behavior is the result of a design decision: the rejection of a systematic detection of software errors at a certain level of the system. The responsibility for ensuring that there are no such errors lies with a higher level of abstraction. For example, it is obvious that a safe language can be compiled into machine code, but it is also obvious that the insecurity of machine code places all uncompromising high-level guarantees on the implementation of the language. Swift and Rust are compiled in LLVM IR; some of their security guarantees are carried out by dynamic checks in the generated code, others are made by type checking and are not represented at the LLVM IR level. In other words, UB at the LLVM level is not detected and is a code issue from a secure subset of Swift and Rust. Even C can be used safely, if any tool in the development environment is convinced of the absence of UB. ProjectL4.verified does just that.
The essence of undefined behavior is that it avoids the connection between error checking operations and unsafe operations. Checks, being unrelated to unsafe operations, can be removed by the optimizer, or, as an option, taken out of the loop. The remaining unsafe operations can be, with a well-designed IR, mapped to processor operations with minimal or no overhead. For example, consider the following Swift code:
Despite the fact that the Swift implementation should catch an exception when the whole is overflowed, the compiler sees that the overflow is impossible and generates such LLVM IR:
Not only the addition operation with overflow check was replaced by the addition operation without checking, but also the addition instruction was marked with the attributes nsw and nuw, which mean that both signed and unsigned overflows are not defined. By themselves, these attributes do not bring benefits, but can cause additional optimizations if the function is inline. When the Swift test suite is compiled in LLVM, only one out of eight addition instruction has an attribute indicating that the integer overflow is undefined.
In this particular example, the nsw and nuw attributes are redundant since the optimization pass may detect the fact that addition cannot cause overflow. However, in the general case, these attributes, and similar ones, have real value, avoiding costly static analysis in order to rediscover what is already known. Also, some facts may not be discovered later, since the information will be lost at some compilation steps.
In general, undefined behavior is an abstraction visible to the programmer, which presents an aggressive and dangerous compromise: sacrifice the correctness of the program in exchange for performance and the simplicity of the compiler. On the other hand, UB at the lower level of the system, such as machine code and the intermediate representation of the compiler, is an internal design decision that has no influence on how the programmer perceives the system. This type of UB simply requires us to accept the fact that security checks can be removed from their corresponding unsafe operations in order to achieve efficient program execution.
At Hacker News, someone commented on my previous post like this:
EMNIP, Gödel’s incompleteness theorem implies that any language has at least some indefinite behavior.
Let us look at this statement, bearing in mind that incompleteness and insolubility are rather complex things. Many years ago, I gladly read Gödel’s theorem: " incompleteness leads to abuse " (I undoubtedly need to re-read it).
Firstly, there are programming languages without UB, for example, such a language that any program on it prints "7". Whatever UB means (we have not yet given a formal definition), it is clear that the language that always prints “7” does not have it.
There are also useful languages that do not have UB, such as the language of expressions that calculate elementary functions over real numbers in IEEE format. It is easy to talk about such languages, because they are not Turing-complete: all calculations are completed and we must make sure that they are completed with a certain result.
But in fact, the commentator with HN had in mind Rice's theorem : “any non-trivial property of a language recognized by the Turing machine is undecidable.” Therefore, if f () is some arbitrary calculation, we cannot generally decide whether the program calls undefined behavior:
But this is nonsense. Rice's theorem applies only to non-trivial properties. To get around this limitation, we need to define a language in which the absence of UB will be a trivial property. This is achieved by the fact that each operation of the program performs a full function: it is defined under any conditions. Thus, programs in this language can either end in a certain state or not complete at all. No extremely complete specifications are usually made for real programming languages, because this will require too much work, especially if the language is open for interactions with other levels of the system, for example, contains built-in assembler. But this problem is purely practical, theoretically, the problem does not exist.
Now we return to the original comment on HN: it was about incompleteness theorems, not about the stop problem. I don’t even know what to say, I don’t see how Gödel’s theorem relates to the indefinite behavior of programming languages.
bring to your attention the translation of two posts from the John Regehr blog. I decided to combine them in one publication because, firstly, they have a small volume, and secondly, the second post is a continuation of the first, and is a response to the comment on the first post on Hacker News.
Link to the first post
Link to the second post
Part 1. Undefined behavior! = Insecure programming
Undefined behavior (UB) in C and C ++ is a danger to developers, especially if the code works with untrusted data. It is less known that undefined behavior exists in the intermediate representation (IR) of most AOT optimizing compilers. For example, LLVM IR has undef and “poisoned” values in addition to the explosive UB of C. When people start to worry about it, a typical reaction is: “What? LLVM IR is just as bad as C! ”This article explains why this is so wrong.
Undefined behavior is the result of a design decision: the rejection of a systematic detection of software errors at a certain level of the system. The responsibility for ensuring that there are no such errors lies with a higher level of abstraction. For example, it is obvious that a safe language can be compiled into machine code, but it is also obvious that the insecurity of machine code places all uncompromising high-level guarantees on the implementation of the language. Swift and Rust are compiled in LLVM IR; some of their security guarantees are carried out by dynamic checks in the generated code, others are made by type checking and are not represented at the LLVM IR level. In other words, UB at the LLVM level is not detected and is a code issue from a secure subset of Swift and Rust. Even C can be used safely, if any tool in the development environment is convinced of the absence of UB. ProjectL4.verified does just that.
The essence of undefined behavior is that it avoids the connection between error checking operations and unsafe operations. Checks, being unrelated to unsafe operations, can be removed by the optimizer, or, as an option, taken out of the loop. The remaining unsafe operations can be, with a well-designed IR, mapped to processor operations with minimal or no overhead. For example, consider the following Swift code:
func add(a : Int, b : Int)->Int {
return (a & 0xffff) + (b & 0xffff);
}
Despite the fact that the Swift implementation should catch an exception when the whole is overflowed, the compiler sees that the overflow is impossible and generates such LLVM IR:
define i64 @add(i64 %a, i64 %b) {
%0 = and i64 %a, 65535
%1 = and i64 %b, 65535
%2 = add nuw nsw i64 %0, %1
ret i64 %2
}
Not only the addition operation with overflow check was replaced by the addition operation without checking, but also the addition instruction was marked with the attributes nsw and nuw, which mean that both signed and unsigned overflows are not defined. By themselves, these attributes do not bring benefits, but can cause additional optimizations if the function is inline. When the Swift test suite is compiled in LLVM, only one out of eight addition instruction has an attribute indicating that the integer overflow is undefined.
In this particular example, the nsw and nuw attributes are redundant since the optimization pass may detect the fact that addition cannot cause overflow. However, in the general case, these attributes, and similar ones, have real value, avoiding costly static analysis in order to rediscover what is already known. Also, some facts may not be discovered later, since the information will be lost at some compilation steps.
In general, undefined behavior is an abstraction visible to the programmer, which presents an aggressive and dangerous compromise: sacrifice the correctness of the program in exchange for performance and the simplicity of the compiler. On the other hand, UB at the lower level of the system, such as machine code and the intermediate representation of the compiler, is an internal design decision that has no influence on how the programmer perceives the system. This type of UB simply requires us to accept the fact that security checks can be removed from their corresponding unsafe operations in order to achieve efficient program execution.
Part 2. Does an expressive programming language always have undefined behavior?
At Hacker News, someone commented on my previous post like this:
EMNIP, Gödel’s incompleteness theorem implies that any language has at least some indefinite behavior.
Let us look at this statement, bearing in mind that incompleteness and insolubility are rather complex things. Many years ago, I gladly read Gödel’s theorem: " incompleteness leads to abuse " (I undoubtedly need to re-read it).
Firstly, there are programming languages without UB, for example, such a language that any program on it prints "7". Whatever UB means (we have not yet given a formal definition), it is clear that the language that always prints “7” does not have it.
There are also useful languages that do not have UB, such as the language of expressions that calculate elementary functions over real numbers in IEEE format. It is easy to talk about such languages, because they are not Turing-complete: all calculations are completed and we must make sure that they are completed with a certain result.
But in fact, the commentator with HN had in mind Rice's theorem : “any non-trivial property of a language recognized by the Turing machine is undecidable.” Therefore, if f () is some arbitrary calculation, we cannot generally decide whether the program calls undefined behavior:
main() {
f();
return 1 / 0; // UB!
}
But this is nonsense. Rice's theorem applies only to non-trivial properties. To get around this limitation, we need to define a language in which the absence of UB will be a trivial property. This is achieved by the fact that each operation of the program performs a full function: it is defined under any conditions. Thus, programs in this language can either end in a certain state or not complete at all. No extremely complete specifications are usually made for real programming languages, because this will require too much work, especially if the language is open for interactions with other levels of the system, for example, contains built-in assembler. But this problem is purely practical, theoretically, the problem does not exist.
Now we return to the original comment on HN: it was about incompleteness theorems, not about the stop problem. I don’t even know what to say, I don’t see how Gödel’s theorem relates to the indefinite behavior of programming languages.