32bit_me October 24, 2017 at 06:16

Why can LLVM call a never called function?

Whatever your dragon tells you, he lied. Dragons are false. You do not know what awaits you on the other side.
Michael Swanwick. "Daughter of the Iron Dragon"

Not so long ago, a post was published on Habr titled " How can a never-called function be called? " The conclusions from the article are simple: in the case of undefined behavior, the compiler has the right to take any action, even if it will be completely unexpected. However, I was interested in the mechanism of this optimization itself. The result of my little research I want to share with the reputable community of Habr.

Let me remind you in a nutshell, what was the point. In the source code below, the EraseAll function should not be called from main, and it really is not called when compiling with -O0, but it is suddenly called with optimization -O1 and higher.

#include 
typedef int (*Function)();
static Function Do;
static int EraseAll() {
  return system("rm -rf /");
}
void NeverCalled() {
  Do = EraseAll;  
}
int main() {
  return Do();
}

This is explained as follows: in the above code, the Do variable is a pointer to a function, and is initially null. When we try to call a function with a null pointer, the program behavior may be undefined (undefined behavior, UB) and the compiler has the right to optimize UB as it is more convenient. In this case, the compiler immediately made the assignment Do = EraseAll.

Why he did this, we will try to figure it out now. Throughout the text, LLVM and Clang version 5.0.0 are used as a compiler.

To begin with, let's look at the IR code during optimization with -O0 and -O1. We’ll slightly modify the source to make it less dramatic:


#include 
typedef int (*Function)();
static Function Do;
static int PrintHello() {
  return printf("hello world\n");
}
void NeverCalled() {
  Do = PrintHello;  
}
int main() {
  return Do();
}

And compile the IR-code at -O0 (the details are omitted for clarity):

; ModuleID = 'test.c'
source_filename = "test.c"
target datalayout = "e-m:e-i64:64-f80:128-n8:16:32:64-S128"
target triple = "x86_64-unknown-linux-gnu"
@Do = internal global i32 (...)* null, align 8
@.str = private unnamed_addr constant [13 x i8] c"hello world\0A\00", align 1
; Function Attrs: noinline nounwind optnone uwtable
define void @NeverCalled() #0 {
entry:
  store i32 (...)* bitcast (i32 ()* @PrintHello to i32 (...)*), i32 (...)** @Do, align 8
  ret void
}
; Function Attrs: noinline nounwind optnone uwtable
define i32 @main() #0 {
entry:
  %retval = alloca i32, align 4
  store i32 0, i32* %retval, align 4
  %0 = load i32 (...)*, i32 (...)** @Do, align 8
  %call = call i32 (...) %0()
  ret i32 %call
}
; Function Attrs: noinline nounwind optnone uwtable
define internal i32 @PrintHello() #0 {
entry:
  %call = call i32 (i8*, ...) @printf(i8* getelementptr inbounds ([13 x i8], [13 x i8]* @.str, i32 0, i32 0))
  ret i32 %call
}
declare i32 @printf(i8*, ...) #1

And at -O1:

; ModuleID = 'test.ll'
source_filename = "test.c"
target datalayout = "e-m:e-i64:64-f80:128-n8:16:32:64-S128"
target triple = "x86_64-unknown-linux-gnu"
@.str = private unnamed_addr constant [13 x i8] c"hello world\0A\00", align 1
; Function Attrs: noinline nounwind optnone uwtable
define void @NeverCalled() local_unnamed_addr #0 {
entry:
  ret void
}
; Function Attrs: noinline nounwind optnone uwtable
define i32 @main() local_unnamed_addr #0 {
entry:
  %retval = alloca i32, align 4
  store i32 0, i32* %retval, align 4
  %call = call i32 (...) bitcast (i32 ()* @PrintHello to i32 (...)*)()
  ret i32 %call
}
; Function Attrs: noinline nounwind optnone uwtable
define internal i32 @PrintHello() unnamed_addr #0 {
entry:
  %call = call i32 (i8*, ...) @printf(i8* getelementptr inbounds ([13 x i8], [13 x i8]* @.str, i32 0, i32 0))
  ret i32 %call
}
declare i32 @printf(i8*, ...) local_unnamed_addr #1

You can compile executable files and make sure that in the first case, a segmentation error occurs, and in the second, “hello world” is displayed. With other optimization options, the result is the same as with -O1.

Now we find the part of the compiler code that performs this optimization. I remind you that in LLVM architecture, the front-end does not optimize itself, i.e. cfe (Clang Frontend) always generates code without optimizations, which we see in the option for -O0, and the opt utility performs optimizations:

With -O1, the following optimization passes are performed:

Impressive, please do not watch

-targetlibinfo -tti -tbaa -scoped-noalias -assumption-cache-tracker -profile-summary-info -forceattrs -inferattrs -ipsccp -globalopt -domtree -mem2reg -deadargelim -domtree -basicaa -aa -instcombine -simplifycfg -basiccg -globals-aa -prune-eh -always-inline -functionattrs -domtree -sroa -basicaa -aa -memoryssa -early-cse-memssa -speculative-execution -domtree -basicaa -aa -lazy-value-info -jump-threading -lazy-value-info -correlated-propagation -simplifycfg -domtree -basicaa -aa -instcombine -libcalls-shrinkwrap -loops -branch-prob -block-freq -pgo-memop-opt -domtree -basicaa -aa -tailcallelim -simplifycfg -reassociate -domtree -loops -loop-simplify -lcssa-verification -lcssa -basicaa -aa -scalar-evolution -loop-rotate -licm -loop-unswitch -simplifycfg -domtree -basicaa -aa -instcombine -loops -loop-simplify -lcssa-verification -lcssa -scalar-evolution -indvars -loop-idiom -loop-deletion -loop-unroll -memdep -memcpyopt -sccp -domtree -demanded-bits -bdce -basicaa -aa -instcombine -lazy-value-info -jump-threading -lazy-value-info -correlated-propagation -domtree -basicaa -aa -memdep -dse -loops -loop-simplify -lcssa-verification -lcssa -aa -scalar-evolution -licm -postdomtree -adce -simplifycfg -domtree -basicaa -aa -instcombine -barrier -basiccg -rpo-functionattrs -globals-aa -float2int -domtree -loops -loop-simplify -lcssa-verification -lcssa -basicaa -aa -scalar-evolution -loop-rotate -loop-accesses -lazy-branch-prob -lazy-block-freq -opt-remark-emitter -loop-distribute -branch-prob -block-freq -scalar-evolution -basicaa -aa -loop-accesses -demanded-bits -lazy-branch-prob -lazy-block-freq -opt-remark-emitter -loop-vectorize -loop-simplify -scalar-evolution -aa -loop-accesses -loop-load-elim -basicaa -aa -instcombine -latesimplifycfg -domtree -basicaa -aa -instcombine -loops -loop-simplify -lcssa-verification -lcssa -scalar-evolution -loop-unroll -instcombine -loop-simplify -lcssa-verification -lcssa -scalar-evolution -licm -alignment-from-assumptions -strip-dead-prototypes -domtree -loops -branch-prob -block-freq -loop-simplify -lcssa-verification -lcssa -basicaa -aa -scalar-evolution -branch-prob -block-freq -loop-sink -lazy-branch-prob -lazy-block-freq -opt-remark-emitter -instsimplify -simplifycfg -verify

Turning off the passages one by one, we find the desired one, this is globalopt. We can leave only this optimization pass, and make sure that it is he, and no one else, that generates the code we need. Its source is located in the file /lib/Transforms/IPO/GlobalOpt.cpp. You can familiarize yourself with the source code in the LLVM repository, and I won’t give it here completely, limiting myself to functions that are important for understanding its operation.
Let's see what this optimization pass does. For starters, it implements the runOnModule method, i.e. when working, he sees and optimizes the entire module (which, however, is logical in this case). The optimization function is directly optimizedGlobalsInModule:

static bool optimizeGlobalsInModule(
    Module &M, const DataLayout &DL, TargetLibraryInfo *TLI,
    function_ref LookupDomTree) {
  SmallSet NotDiscardableComdats;
  bool Changed = false;
  bool LocalChange = true;
  while (LocalChange) {
    LocalChange = false;
    NotDiscardableComdats.clear();
    for (const GlobalVariable &GV : M.globals())
      if (const Comdat *C = GV.getComdat())
        if (!GV.isDiscardableIfUnused() || !GV.use_empty())
          NotDiscardableComdats.insert(C);
    for (Function &F : M)
      if (const Comdat *C = F.getComdat())
        if (!F.isDefTriviallyDead())
          NotDiscardableComdats.insert(C);
    for (GlobalAlias &GA : M.aliases())
      if (const Comdat *C = GA.getComdat())
        if (!GA.isDiscardableIfUnused() || !GA.use_empty())
          NotDiscardableComdats.insert(C);
    // Delete functions that are trivially dead, ccc -> fastcc
    LocalChange |=
        OptimizeFunctions(M, TLI, LookupDomTree, NotDiscardableComdats);
    // Optimize global_ctors list.
    LocalChange |= optimizeGlobalCtorsList(M, [&](Function *F) {
      return EvaluateStaticConstructor(F, DL, TLI);
    });
    // Optimize non-address-taken globals.
    LocalChange |= OptimizeGlobalVars(M, TLI, LookupDomTree,
                                      NotDiscardableComdats);
    // Resolve aliases, when possible.
    LocalChange |= OptimizeGlobalAliases(M, NotDiscardableComdats);
    // Try to remove trivial global destructors if they are not removed
    // already.
    Function *CXAAtExitFn = FindCXAAtExit(M, TLI);
    if (CXAAtExitFn)
      LocalChange |= OptimizeEmptyGlobalCXXDtors(CXAAtExitFn);
    Changed |= LocalChange;
  }
  // TODO: Move all global ctors functions to the end of the module for code
  // layout.
  return Changed;
}

Let's try to describe in words what this function does. For each global variable in the module, it requests a Comdat object.

What is Comdat

A comdat section is a section in an object file that contains objects that can be duplicated in other object files. Each object has information for the linker, indicating what it should do when duplicates are detected. The options may be the following: Any - anything, ExactMatch - duplicates should match completely, otherwise an error occurs, Largest - take the object with the highest value, NoDublicates - should not be a duplicate, SameSize - duplicates should have the same size, otherwise an error occurs.

In LLVM, Comdat data is represented by the enumeration:

enum SelectionKind {
    Any,          ///< The linker may choose any COMDAT.
    ExactMatch,   ///< The data referenced by the COMDAT must be the same.
    Largest,      ///< The linker will choose the largest COMDAT.
    NoDuplicates, ///< No other Module may specify this COMDAT.
    SameSize,     ///< The data referenced by the COMDAT must be the same size.
  };

, and the Comdat class is actually a pair (Name, SelectionKind). ( In fact, everything is more complicated. ) All variables that for some reason cannot be deleted are placed in the set of NotDiscardableComdats. With the functions and global aliases, we do the same - that which cannot be deleted, is placed in NotDiscardableComdats. Then, individual optimization functions of global constructors, global functions, global variables, global aliases, and global destructors are called. Optimizations continue in the loop until no optimizations are performed. At each iteration of the loop, many NotDiscardableComdats are reset.

Let's see which of the listed objects contains our test source.

Global Variables:

1. @Do = internal global i32 (...)* null, align 8
2. @.str = private unnamed_addr constant [13 x i8] c"hello world\0A\00", align 1

(looking a little ahead, I’ll say that the optimizer will delete the first variable at the first iteration)

Functions:

define void @NeverCalled()
define i32 @main()
define internal i32 @PrintHello()
declare i32 @printf(i8*, ...)

Note that printf is only declared (declare), but not defined (define).
There are no global aliases.

Let's look at the example of this optimization pass how this result turned out. Of course, disassembling completely all optimization options even in one pass is a very voluminous task, because it provides many different special cases of optimizations. Let us focus specifically on our example, simultaneously examining those functions and data structures that are important for understanding the operation of this optimization pass.

At first, the optimizer makes various checks of little interest in this case, and calls the function processInternalGlobal, which tries to optimize global variables. This function is also quite complicated and does a lot of different things, but we are interested in one:

    
if (GS.StoredType == GlobalStatus::StoredOnce && GS.StoredOnceValue) {
...
// Пытаемся оптимизировать глобальные переменные, про которые известно, что им присваивается 
// значение только один раз, не считая инициализирующего значения.
    if (optimizeOnceStoredGlobal(GV, GS.StoredOnceValue, GS.Ordering, DL, TLI))
      return true;
...
}

Information that a global variable is assigned a value once and only once is retrieved from the GS (GlobalStatus) structure. This structure is populated in the calling function:

static bool
processGlobal(GlobalValue &GV, TargetLibraryInfo *TLI,
              function_ref LookupDomTree) {
  if (GV.getName().startswith("llvm."))
    return false;
  GlobalStatus GS;
  if (GlobalStatus::analyzeGlobal(&GV, GS))
    return false;
...

Here we see another interesting fact: objects whose names begin with "llvm." cannot be optimized (since they are system calls of llvm runtime). And, just in case, variable names in the LLVM IR language can contain periods (and even consist of a single point with the @ or% prefix). The analyzeGlobal function is a call to the LLVM API and we will not consider its internal operation. It is worthwhile to dwell on the GlobalStatus structure in more detail, since it contains very important information for optimization passes.

/// Когда мы анализируем глобальную переменную, сохраняем некоторую информацию о ней.  Если мы 
/// обнаружили, что происходит взятие адреса переменной, никакая информация из этой структуры
/// не будет достоверной
struct GlobalStatus {
  /// True, если адрес глобальной переменной используется в операциях сравнения
  bool IsCompared = false;
  /// True, если глобальной переменной присваивается значение. Если нет, она
  /// может быть удалена
  bool IsLoaded = false;
  /// Каким образом происходит присваивание значения
  enum StoredType {
    /// Нет присваивания значения. Переменная может быть помечена как константа
    NotStored,
    /// Присваивание происходит, но присваивается та же константа, с которой переменная 
    /// была инициализирована. Это отслеживается только для скалярных типов.
    InitializerStored,
    /// Происходит присваивание, но только при инициализации и ещё один раз
    /// другим значением.  Если переменная isStoredOnce, мы записываем значение,
    /// которое было присвоено, в поле StoredOnceValue ниже.  Это проверяется только для скалярных
    /// переменных.
    StoredOnce,
    /// Присваивание происходит многократно или таким образом, который 
    /// мы не можем отследить.
    Stored
  } StoredType = NotStored;
  /// Если только одно значение (кроме инициализирующей константы) записано
  /// в эту глобальную переменную, сохраняем значение здесь.
  Value *StoredOnceValue = nullptr;
...
};

It is probably worth explaining why “if we find that the address of a variable is being taken, no information from this structure will be reliable.” In fact, if we take the address of a global variable, and then write something at this address, and not by name, then it will be extremely difficult to track it, and it is better to leave such variables as they are, without trying to optimize.

So, we get to the optimizeOnceStoredGlobal function, in the arguments of which the variable (GV) and the stored value (StoredOnceVal) are passed. Here they are:

@Do = internal unnamed_addr global i32 (...)* null, align 8 //переменная
i32 (...)* bitcast (i32 ()* @PrintHello to i32 (...)*) // значение

Further, the insignificant bitcast is deleted for the value, and the following condition is checked for the variable:

 if (GV->getInitializer()->getType()->isPointerTy() &&
      GV->getInitializer()->isNullValue()) {
...

that is, the variable must be initialized with a null pointer. If so, then create a new SOVC variable corresponding to the value of StoredOnceVal cast to GV type:

    if (Constant *SOVC = dyn_cast(StoredOnceVal)) {
      if (GV->getInitializer()->getType() != SOVC->getType())
        SOVC = ConstantExpr::getBitCast(SOVC, GV->getInitializer()->getType());

Here getBitCast is a method that returns a bitcast command that casts types in LLVM IR.
After that, the OptimizeAwayTrappingUsesOfLoads function is called. The global variable GV and the constant LV are passed to it.
The optimization is performed directly by the OptimizeAwayTrappingUsesOfValue (Value * V, Constant * NewV) function.
For each use of a variable:

  for (auto UI = V->user_begin(), E = V->user_end(); UI != E; ) {
    Instruction *I = cast(*UI++);

if it is a Load command, replace its operand with a new value:

    if (LoadInst *LI = dyn_cast(I)) {
      LI->setOperand(0, NewV);
      Changed = true;
    }

If the variable is used in the call or invoke function (namely, this happens in our example), create a new function, replacing its argument with a new value:

if (isa(I) || isa(I)) {
      CallSite CS(I);
      if (CS.getCalledValue() == V) {
        // Calling through the pointer!  Turn into a direct call, but be careful
        // that the pointer is not also being passed as an argument.
        CS.setCalledFunction(NewV);
        Changed = true;
        bool PassedAsArg = false;
        for (unsigned i = 0, e = CS.arg_size(); i != e; ++i)
          if (CS.getArgument(i) == V) {
            PassedAsArg = true;
            CS.setArgument(i, NewV);
          }

All other function arguments are simply copied.
Similar replacement algorithms for Cast and GEP instructions are also provided, but in our case this does not bypass.

Further actions are as follows: we look through all the uses of the global variable, trying to delete everything except the value assignment. If this succeeds, then we can remove the Do variable.

So, we briefly reviewed the work of the LLVM optimization pass using a specific example. In principle, there is nothing super complicated here, but great accuracy in programming is required to provide all possible combinations of commands and types of variables. Of course, all this should be covered by tests. Studying the source code of LLVM optimizers will help you write your own optimizations that allow you to improve the code for some specific cases.

Examples of some interesting optimizations in LLVM

I will give some more examples of how LLVM is able to optimize the code. These examples are not related to the example that we just examined, and are done in other optimization passes, however, they are quite unusual and interesting.

First example

Consider a code that sums numbers from 1 to n-1:

int sum(int n) {
  int s = 0;
  for(int i = 0; i < n; i++)
    s += i;
  return s;
}

Compile with -O1:

define i32 @sum(i32 %n) local_unnamed_addr #0 {
entry:
  %cmp6 = icmp sgt i32 %n, 0
  br i1 %cmp6, label %for.cond.cleanup.loopexit, label %for.cond.cleanup
for.cond.cleanup.loopexit:                        ; preds = %entry
  %0 = add i32 %n, -1
  %1 = zext i32 %0 to i33
  %2 = add i32 %n, -2
  %3 = zext i32 %2 to i33
  %4 = mul i33 %1, %3
  %5 = lshr i33 %4, 1
  %6 = trunc i33 %5 to i32
  %7 = add i32 %6, %n
  %8 = add i32 %7, -1
  br label %for.cond.cleanup
for.cond.cleanup:                                 ; preds = %for.cond.cleanup.loopexit, %entry
  %s.0.lcssa = phi i32 [ 0, %entry ], [ %8, %for.cond.cleanup.loopexit ]
  ret i32 %s.0.lcssa
}

Suddenly, there is no loop, but there are wonderful variables like i33 (that is, a 33-bit integer). How did it happen? LLVM turned the sum of the series into the formula: (n-1) * (n-2) / 2 + n - 1. Since when calculating intermediate variables, a 32-bit grid may overflow, LLVM inserted an i33 variable. Note that he did this by analyzing non-optimized assembler code, which is rather complicated. Under the spoiler is a non-optimized code for the same function that directly counts in a loop:

unoptimized code

define i32 @sum(i32 %n) #0 {
entry:
  %n.addr = alloca i32, align 4
  %s = alloca i32, align 4
  %i = alloca i32, align 4
  store i32 %n, i32* %n.addr, align 4
  store i32 0, i32* %s, align 4
  store i32 0, i32* %i, align 4
  br label %for.cond
for.cond:                                         ; preds = %for.inc, %entry
  %0 = load i32, i32* %i, align 4
  %1 = load i32, i32* %n.addr, align 4
  %cmp = icmp slt i32 %0, %1
  br i1 %cmp, label %for.body, label %for.end
for.body:                                         ; preds = %for.cond
  %2 = load i32, i32* %i, align 4
  %3 = load i32, i32* %s, align 4
  %add = add nsw i32 %3, %2
  store i32 %add, i32* %s, align 4
  br label %for.inc
for.inc:                                          ; preds = %for.body
  %4 = load i32, i32* %i, align 4
  %inc = add nsw i32 %4, 1
  store i32 %inc, i32* %i, align 4
  br label %for.cond
for.end:                                          ; preds = %for.cond
  %5 = load i32, i32* %s, align 4
  ret i32 %5
}

It’s even more interesting to watch what happens with this example in the backend. The i33 variable is converted to i64, and if your processor is 32-bit, sequences of commands are generated to multiply and add 64-bit numbers on a 32-bit system. Even more interesting, if you change the data type to long in the original example. Then the argument and return value will become of type i64, and intermediate variables will become of type i65!

Second example

Suppose we decided to write a function that reverses the sign of float, changing the 31st bit of the binary representation of float:

float sum(float x) {
  int val = *((int*) &x);
  int inv = val ^ (1 << 31);
  return *((float*)&inv);
}

When compiling on x86_64, nothing particularly interesting happens:

.LCPI0_0:
	.long	2147483648              # float -0
	.long	2147483648              # float -0
	.long	2147483648              # float -0
	.long	2147483648              # float -0
	.text
	.globl	sum
	.p2align	4, 0x90
	.type	sum,@function
sum:                                    # @sum
	.cfi_startproc
# BB#0:                                 # %entry
	xorps	.LCPI0_0(%rip), %xmm0
	retq

But if we compile for ARM 64 (AARCH64):

invert:                                 // @invert
// BB#0:                                // %entry
	fneg	s0, s0
	ret

LLVM recognized the fneg command changing the sign of float in the 31st bit change. For comparison, GCC does not know how, issuing a “verbatim” option.
GCC 6.3 (ARM 64):

invert(float):
  fmov w0, s0
  eor w0, w0, -2147483648
  fmov s0, w0
  ret

This is an example of target-specific optimization, which is done in the backend, and not in the opt utility.

A couple of words need to be said about this example. Such pointer actions violate the strict aliasing rule, which can lead to undefined behavior when compiling with the -strict-aliasing flag on some compilers and on some platforms (in fact, in a vanishingly small number of cases). For this example, an error occurs when compiling with gcc4.4.7 -m32 -O2, and disappears in more recent versions of gcc. Nevertheless, I inserted in the list of links at the end a link to an interesting lecture on aliasing.

Third example

Another example of target-specific optimization, this time for x86-64, or rather, for Haswell architecture.
We write a function for counting single bits in a word:

int cntSetBits(int a) {
  int cnt = 0;
  while(a)
  {
    cnt++;
    a &= (a-1);
  }
  return cnt;
}

Compile with -O1 -march = haswell:

cntSetBits(int): # @cntSetBits(int)
  popcnt eax, edi
  ret

The whole function fits in one popcnt statement, which counts the number of units in a word.
Let's look at IR:

; Function Attrs: norecurse nounwind readnone uwtable
define i32 @cntSetBits(i32 %a) local_unnamed_addr #0 {
entry:
  %0 = call i32 @llvm.ctpop.i32(i32 %a), !range !2
  ret i32 %0
}

Here we see that the built-in function llvm.ctpop.i32 is used. It has already been inserted by the frontend, using high-level information about the code, and the backend for this architecture knows about this function and can replace it with a special command.

useful links

http://www.drdobbs.com/architecture-and-design/the-design-of-llvm/240001128 - Chris Lattner, “The Design of LLVM”
https://youtu.be/bSkpMdDe4g4 - Matt Godbolt, “What has my compiler done for me lately? ”
https://youtu.be/ACW-7dktyDk Dmitry Kashitsyn “Loaf trolleybus: aliasing and vectorization in LLVM”

Tags: