How Clang Compiles a Function

Original author: John Regehr, Professor of Computer Science, University of Utah, USA
  • Transfer
I planned to write an article about how LLVM optimizes a function, but first you need to write how Clang translates C or C ++ to LLVM.

image


Consider the high-level aspects, not plunging into the depths of Clang. I want to draw attention to how the output of Clang relates to the input, while we will not consider the nontrivial possibilities of C ++. We use this small function, which I borrowed from the excellent lectures on cycle optimization :

boolis_sorted(int *a, int n){
  for (int i = 0; i < n - 1; i++)
    if (a[i] > a[i + 1])
      returnfalse;
  returntrue;
}

Since Clang does not make any optimizations, and since LLVM IR was originally designed to work with C and C ++, the conversion is relatively easy. I will use Clang 6.0.1 (or a close version, since this one has not yet been released) on x86-64.

The command line is as follows:

clang++ is_sorted.cpp -O0 -S -emit-llvm

In other words: compile the is_sorted.cpp file as C ++ and then we say the following LLVM toolchain: do not optimize, display the assembler, as a textual representation of LLVM IR. LLVM IR is voluminous, and cannot be quickly output or parsed, binary bitcode format is always preferable if a person does not need to look at this code. Here is the full LLVM IR, we will look at it in parts.

Let's start with the top of the file:

; ModuleID = 'is_sorted.cpp'
source_filename = "is_sorted.cpp"
target datalayout = "e-m:e-i64:64-f80:128-n8:16:32:64-S128"
target triple = "x86_64-unknown-linux-gnu"

The whole text between the semicolon and the end of the line is a comment, which means that the first line does nothing, but if you are interested, in LLVM a “module” is a compilation unit, a container for code and data. The second line should not bother us either. The third line describes some assumptions made by the compiler; they do not play a role in this article, but you can read more here . The target troika is the legacy of gcc and we will not need it later.

The LLVM function has optional attributes:

; Function Attrs: noinline nounwind optnone uwtable

Some of them (like these) are supported by the front end, others are added later by optimization passes. These attributes have nothing to do with the meaning of the code, I will not discuss them here, but you can read about them here if you are interested.

And finally, our function:

define zeroext i1 @_Z9is_sortedPii(i32* %a, i32 %n) #0 {

“Zeroext” means that the return value of the function (i1, one-bit integer) must be expanded with zeros in the backend, to the width that the ABI requires. Then comes the “augmented” (mangled) function name, then the parameter list is basically the same as in C ++ code, except that i32 defines a 32-bit variable. # 0 connects the function to the attribute group at the end of the file.

Here is the first base unit:

entry:
  %retval = alloca i1, align 1
  %a.addr = alloca i32*, align 8
  %n.addr = alloca i32, align 4
  %i = alloca i32, align 4
  store i32* %a, i32** %a.addr, align 8
  store i32 %n, i32* %n.addr, align 4
  store i32 0, i32* %i, align 4
  br label %for.cond

Each LLVM instruction must be located inside the base unit: a set of instructions that has one input at the beginning and one output at the end. The last instruction of the base unit must be a terminating instruction : “dropping” into the next base unit is not allowed. Each function must have an input block that does not have predecessors (predecessors) that perform the transition to this block. This and other properties are checked when IR is parsed, these checks can also be invoked multiple times during the compilation process by the “module verifier”. The verifier is useful for debugging when the pass generates the wrong IR.

The first four instructions in this base block are alloca: allocating stack memory. The first three create variables, implicitly created during compilation, the fourth - a loop variable. Variables allocated in this way can only be accessed via the load and store instructions. The following three instructions initialize three stack slots, a.addr and n.addr are initialized using the values ​​passed to the function as parameters, and i is initialized to zero. The return value does not need to be initialized, any code that is not undefined in C and C ++ will have to take care of this. The last instruction is an unconditional transition to the next base unit (we don’t worry about it yet, most of the unnecessary transitions will be removed by the LLVM backend).

You might ask: Why does Clang allocate stack slots for a and n? Why doesn't he just use these values ​​directly? In this function, since a and n do not change, such a strategy will work, but this case will be taken into account by the optimizer, and is beyond the competence of Calng. If a and n can be modified, they should be in memory, and should not be SSA-values, which, by definition, can take on value only at one point in the program. Memory cells are outside the SSA world and can be modified at any time. This may seem strange, but this solution allows you to organize the work of all parts of the compiler in a natural and effective way.

I think of Clang as a degenerate SSA code generator that satisfies all SSA requirements, but only because information is exchanged between the base units through memory. The generation of non-degenerate code requires some care and some analysis, and the Clang developers refused to do this in order to share the responsibilities of code generation and optimization. I did not see the measurement results, but in my understanding, a lot of memory operations are generated, and then almost immediately most of them are deleted by the optimizer, without leading to a large overhead of compile time.

Consider how the for loop is broadcast. In general, it looks like this:

for (initializer; condition; modifier) {
  body
}

This translates to something like this:

  initializer
  goto COND
COND:
  if (condition)
    goto BODY
  elsegoto EXIT
BODY:
  body
  modifier
  goto COND
EXIT:

Of course, this translation is not Clang specific, any C and C ++ compiler does the same.

In our example, the loop initializer is minimized to an input base unit. The following base block is a verification of the cycle condition:

for.cond:                                         ; preds = %for.inc, %entry
  %0 = load i32, i32* %i, align 4
  %1 = load i32, i32* %n.addr, align 4
  %sub = sub nsw i32 %1, 1
  %cmp = icmp slt i32 %0, %sub
  br i1 %cmp, label %for.body, label %for.end

Clang also makes a useful comment that this base block can be reached either from for.inc or from an input base block. This block loads i and n from memory, reduces n (the nsw flag reflects the C property that the sign overflow is undefined; without this flag, LLVM uses the additional code semantics), compares the reduced value with i, using the slt command (signed less than, signed less than) and then finally branch to the for.body or for.end base unit.

Logging into the loop body is possible only from the for.cond block:

for.body:
  %2 = load i32*, i32** %a.addr, align 8
  %3 = load i32, i32* %i, align 4
  %idxprom = sext i32 %3 to i64
  %arrayidx = getelementptr inbounds i32, i32* %2, i64 %idxprom
  %4 = load i32, i32* %arrayidx, align 4
  %5 = load i32*, i32** %a.addr, align 8
  %6 = load i32, i32* %i, align 4
  %add = add nsw i32 %6, 1
  %idxprom1 = sext i32 %add to i64
  %arrayidx2 = getelementptr inbounds i32, i32* %5, i64 %idxprom1
  %7 = load i32, i32* %arrayidx2, align 4
  %cmp3 = icmp sgt i32 %4, %7
  br i1 %cmp3, label %if.then, label %if.end

The first two lines load a and i into SSA registers; i then expands to 64 bits and can participate in the address calculation. The getelementptr team (or gep for short) is the well-known LLVM command, it even has its own help section . Unlike machine language, LLVM does not treat pointers as integers. This facilitates alias analysis and other memory optimizations. This code loads a [i] and a [i + 1], compares them, and performs branching depending on the result.

The if.then block saves 0 to the stack slot for the return value of the function and performs an unconditional transition to the output block of the function:

if.then:
  store i1 false, i1* %retval, align 1
  br label %return

The else block is trivial:

if.end:
  br label %for.inc

And the block for adding one to the loop variable is also very simple:

for.inc:
  %8 = load i32, i32* %i, align 4
  %inc = add nsw i32 %8, 1
  store i32 %inc, i32* %i, align 4
  br label %for.cond

This code jumps back to the loop condition check.

If the loop ends normally, we return true:

for.end:
  store i1 true, i1* %retval, align 1
  br label %return

Finally, what we loaded into the stack return value slot is loaded and returned:

return:
  %9 = load i1, i1* %retval, align 1
  ret i1 %9

There is nothing special at the end of the function. The post has turned out longer than I thought, in the next post we will look at the optimization of the IR level for this function.

(Thanks to Xi Wang and Alex Rosenberg for the fixes sent)

Also popular now: