atrosinenko May 20, 2019 at 12:57

Part 1. QInst: it’s better to lose a day, then fly in five minutes (writing instruments is trivial)

In the previous part, I roughly described how you can load eBPF functions from an ELF file. Now it’s time to move from fantasy to Soviet cartoons, and following wise advice, after spending a certain amount of effort once, make a universal instrumentation tool~~(or, in short, UII !!!)~~. In doing so, I will take advantage of the Golden Hammer antipattern design and build a tool from the relatively familiar QEMU . As a bonus for this, we get cross-architectural instrumentation, as well as instrumentation at the level of the whole virtual computer. Instrumentation will be of the form “a small native so-file + a small .o-file with eBPF”. In this case, eBPF functions will be substituted before the corresponding instructions of the internal representation of QEMU before optimization and code generation.

As a result, the instrumentation itself, which is added during code generation (that is, not counting a couple of kilobytes of normal system runtime), looks like this, and this is not a pseudo-code:

#include 
extern uint8_t *__afl_area_ptr;
extern uint64_t prev;
void inst_qemu_brcond_i64(uint64_t tag, uint64_t x, uint64_t y, uint64_t z, uint64_t u)
{
    __afl_area_ptr[((prev >> 1) ^ tag) & 0xFFFF] += 1;
    prev = tag;
}
void inst_qemu_brcond_i32(uint64_t tag, uint64_t x, uint64_t y, uint64_t z, uint64_t u)
{
    __afl_area_ptr[((prev >> 1) ^ tag) & 0xFFFF] += 1;
    prev = tag;
}

Well, it's time to load our elf into the Matrix. Well, how to download, rather~~smack~~ spray.

As already mentioned in the article about QEMU.js , one of the QEMU modes of operation is JIT generation of host machine code from guest (potentially, for a completely different architecture). If the last time I implemented my code generation backend, then this time I am going to process the internal representation by wedging right in front of the optimizer. Is this an arbitrary decision? Not. There is a hope that the optimizer will cut off excess corners, throw out unnecessary variables, etc. As far as I understand, he, in fact, does simple and quickly doable things: pushing constants, throwing out expressions like “x: = x + 0” and deleting unreachable code. And we can get a decent amount of it.

Assembly script configuration

First of all, let's add our source files: tcg/bpf-loader.cand tcg/instrument.cin the Makefile-s. Generally speaking, there is a desire to someday push this into upstream, so you will need to do it in the end wisely, but for now I will just unconditionally add these files to the assembly. And I will take the parameters in the best traditions of AFL - through environment variables. By the way, I will test this again on the instrumentation for AFL.

Just look for the mention of the "neighbor" - the file optimize.cwith the help grep -Rand find nothing. Because it was necessary to search optimize.o:

--- a/Makefile.target
+++ b/Makefile.target
@@ -110,7 +110,7 @@ obj-y += trace/
 obj-y += exec.o
 obj-y += accel/
 obj-$(CONFIG_TCG) += tcg/tcg.o tcg/tcg-op.o tcg/tcg-op-vec.o tcg/tcg-op-gvec.o
-obj-$(CONFIG_TCG) += tcg/tcg-common.o tcg/optimize.o
+obj-$(CONFIG_TCG) += tcg/tcg-common.o tcg/optimize.o tcg/instrument.o tcg/bpf-loader.o
 obj-$(CONFIG_TCG_INTERPRETER) += tcg/tci.o
 obj-$(CONFIG_TCG_INTERPRETER) += disas/tci.o
 obj-$(CONFIG_TCG) += fpu/softfloat.o

So here you are, metaprogramming in C ...

To begin with, we add bpf-loader.cfrom the last series a code that pulls out entry points corresponding to QEMU operations. And a mysterious file will help us with this tcg-opc.h. It looks like this:

/*
 * DEF(name, oargs, iargs, cargs, flags)
 */
/* predefined ops */
DEF(discard, 1, 0, 0, TCG_OPF_NOT_PRESENT)
DEF(set_label, 0, 0, 1, TCG_OPF_BB_END | TCG_OPF_NOT_PRESENT)
/* variable number of parameters */
DEF(call, 0, 0, 3, TCG_OPF_CALL_CLOBBER | TCG_OPF_NOT_PRESENT)
DEF(br, 0, 0, 1, TCG_OPF_BB_END)
// ...

What nonsense? And the thing is simply that it is not connected in the source header - you need to define a macro DEF, include this file, and immediately delete the macro. See, he doesn't even have guard.

static const char *inst_function_names[] = {
#define DEF(name, a, b, c, d) stringify(inst_qemu_##name),
#include "tcg-opc.h"
#undef DEF
  NULL
};

As a result, we get a neat array of target function names, indexed by opcodes and ending with NULL, which we can run for each character in the file. I understand that this is not effective. But it’s simple, which is important, given the one-time nature of this operation. Next, we just skip all the characters for which

ELF64_ST_BIND(sym->st_info) == STB_LOCAL || ELF64_ST_TYPE(sym->st_info) != STT_FUNC

The rest is checked against the list.

We are attached to a flow of execution

Now you need to get up somewhere on the flow of the code generation mechanism, and wait until the instruction of interest passes by. But first you need to define your functions instrumentation_init, tcg_instrumentand write their calls instrumentation_shutdownin the file tcg/tcg.h: initialization - after the backend is initialized, instrumentation - right before the call tcg_optimize. It would seem that instrumentation_shutdownyou can hang in instrumentation_initon atexitand not take a steam bath. I thought so too, and most likely it will work in the full system emulation mode, but in the usermode emulation mode QEMU translates system calls exit_groupand sometimes exitinto a function call _exitthat ignores all these atexit-handlers, so we’ll find it in linux-user/syscall.cand write before it the call of our code.

Interpreting Bytecode

So it’s time to read what the compiler generated for us. This is conveniently done with the help of llvm-objdumpthe parameter -x, but better immediately -d -t -r.

Output example

$ ./compile-bpf.sh 
test-bpf.o:     file format ELF64-BPF
Disassembly of section .text:
0000000000000000 inst_brcond_i64:
       0:       18 02 00 00 00 00 00 00 00 00 00 00 00 00 00 00         r2 = 0 ll
                0000000000000000:  R_BPF_64_64  prev
       2:       79 23 00 00 00 00 00 00         r3 = *(u64 *)(r2 + 0)
       3:       77 03 00 00 01 00 00 00         r3 >>= 1
       4:       7b 32 00 00 00 00 00 00         *(u64 *)(r2 + 0) = r3
       5:       af 13 00 00 00 00 00 00         r3 ^= r1
       6:       57 03 00 00 ff ff 00 00         r3 &= 65535
       7:       18 04 00 00 00 00 00 00 00 00 00 00 00 00 00 00         r4 = 0 ll
                0000000000000038:  R_BPF_64_64  __afl_area_ptr
       9:       79 44 00 00 00 00 00 00         r4 = *(u64 *)(r4 + 0)
      10:       0f 34 00 00 00 00 00 00         r4 += r3
      11:       71 43 00 00 00 00 00 00         r3 = *(u8 *)(r4 + 0)
      12:       07 03 00 00 01 00 00 00         r3 += 1
      13:       73 34 00 00 00 00 00 00         *(u8 *)(r4 + 0) = r3
      14:       7b 12 00 00 00 00 00 00         *(u64 *)(r2 + 0) = r1
      15:       95 00 00 00 00 00 00 00         exit
0000000000000080 inst_brcond_i32:
      16:       18 02 00 00 00 00 00 00 00 00 00 00 00 00 00 00         r2 = 0 ll
                0000000000000080:  R_BPF_64_64  prev
      18:       79 23 00 00 00 00 00 00         r3 = *(u64 *)(r2 + 0)
      19:       77 03 00 00 01 00 00 00         r3 >>= 1
      20:       7b 32 00 00 00 00 00 00         *(u64 *)(r2 + 0) = r3
      21:       af 13 00 00 00 00 00 00         r3 ^= r1
      22:       57 03 00 00 ff ff 00 00         r3 &= 65535
      23:       18 04 00 00 00 00 00 00 00 00 00 00 00 00 00 00         r4 = 0 ll
                00000000000000b8:  R_BPF_64_64  __afl_area_ptr
      25:       79 44 00 00 00 00 00 00         r4 = *(u64 *)(r4 + 0)
      26:       0f 34 00 00 00 00 00 00         r4 += r3
      27:       71 43 00 00 00 00 00 00         r3 = *(u8 *)(r4 + 0)
      28:       07 03 00 00 01 00 00 00         r3 += 1
      29:       73 34 00 00 00 00 00 00         *(u8 *)(r4 + 0) = r3
      30:       7b 12 00 00 00 00 00 00         *(u64 *)(r2 + 0) = r1
      31:       95 00 00 00 00 00 00 00         exit
SYMBOL TABLE:
0000000000000000 l    df *ABS*           00000000 test-bpf.c
0000000000000000 l    d  .text           00000000 .text
0000000000000000         *UND*           00000000 __afl_area_ptr
0000000000000080 g     F .text           00000080 inst_brcond_i32
0000000000000000 g     F .text           00000080 inst_brcond_i64
0000000000000008 g     O *COM*           00000008 prev

If you try to look for a description of the eBPF opcodes, it turns out that in obvious places (source and man-pages of the Linux kernel) there are descriptions of how to use it, how to compile, etc. Then you come across the iovisor tool team page with a convenient unofficial eBPF reference.

The instruction occupies one 64-bit word (some two) and has the form

struct {
  uint8_t opcode;
  uint8_t dst:4;
  uint8_t src:4;
  uint16_t offset;
  uint32_t imm;
};

Those that occupy two words simply consist of the first instruction with all the logic and a “trailer” with 32 more bits of immediate value and are very clearly visible on the objdump disassembler.

The opcodes themselves also have a regular structure: the lower three bits are the operation class: 32-bit ALU, 64-bit ALU, load / store, conditional branching. Therefore, it is very convenient to implement them on macros in the best traditions of QEMU. I will not conduct detailed instructions on the code base~~we are not on code review~~I’d better tell you about the pitfalls.

My first problem was that I made a lazy eBPF register allocator in the form of QEMUs local_temp, and began to thoughtlessly transfer the call of this function to the macro. It turned out like in a famous meme: "We inserted an abstraction into an abstraction so that you can generate an instruction while you generate an instruction." Post factum, I already do not understand very well what was broken then, but something strange was apparently happening with the order of the generated instructions. After that, I made analogs of functions tcg_gen_...to push new instructions into the middle of the list, taking operands as arguments to the function, and the order automatically became as it should (since the arguments are completely calculated exactly once before the call).

The second problem was trying to push the TCG const as the operand of an arbitrary instruction when looking at the immediate operand in eBPF. According to the aforementioned tcg-opc.h, the composition of the argument list of the operation is strictly fixed: ninput arguments, moutput and kconstant. By the way, when debugging such code, it really helps to pass QEMU a command line argument -d op,op_optor even -d op,op_opt,out_asm.

Possible Arguments

$ ./x86_64-linux-user/qemu-x86_64 -d help
Log items (comma separated):
out_asm         show generated host assembly code for each compiled TB
in_asm          show target assembly code for each compiled TB
op              show micro ops for each compiled TB
op_opt          show micro ops after optimization
op_ind          show micro ops before indirect lowering
int             show interrupts/exceptions in short format
exec            show trace before each executed TB (lots of logs)
cpu             show CPU registers before entering a TB (lots of logs)
fpu             include FPU registers in the 'cpu' logging
mmu             log MMU-related activities
pcall           x86 only: show protected mode far calls/returns/exceptions
cpu_reset       show CPU state before CPU resets
unimp           log unimplemented functionality
guest_errors    log when the guest OS does something invalid (eg accessing a
non-existent register)
page            dump pages at beginning of user mode emulation
nochain         do not chain compiled TBs so that "exec" and "cpu" show
complete traces
trace:PATTERN   enable trace events
Use "-d trace:help" to get a list of trace events.

Well, do not repeat my mistakes: the disassembler of internal instructions is quite advanced, and if you see something like it add_i64 loc15,loc15,$554412123213, then this thing after the dollar sign - this is not a pointer. More precisely, this, of course, is a pointer, but perhaps hung with flags and in the role of the literal value of the operand, and not the pointer. All this is applicable, of course, if you know that there must be some specific number, like $0or $ff, you don’t need to be afraid of pointers at all. :) How to deal with this - you just need to create a function that returns a fresh one temp, in which it moviwill put the desired constant.

By the way, if you tcg/tcg.ccomment out in the header #define USE_TCG_OPTIMIZATIONS, then, suddenly, optimization will turn off, and it will be easier to analyze code transformations.

For sim, I will send a reader interested in picking QEMU into the documentation , even the official one! For the rest, I will demonstrate the promised instrumentation for AFL.

The same and the rabbit

For the full text of runtime, I, again, will send the reader to the repository, since it (the text) is not of artistic value and honestly stolen from the qemu_modeAFL delivery, and in general, is just an ordinary piece of C code. But here is the instrumentation itself:

#include 
extern uint8_t *__afl_area_ptr;
extern uint64_t prev;
void inst_qemu_brcond_i64(uint64_t tag, uint64_t x, uint64_t y, uint64_t z, uint64_t u)
{
    __afl_area_ptr[((prev >> 1) ^ tag) & 0xFFFF] += 1;
    prev = tag;
}
void inst_qemu_brcond_i32(uint64_t tag, uint64_t x, uint64_t y, uint64_t z, uint64_t u)
{
    __afl_area_ptr[((prev >> 1) ^ tag) & 0xFFFF] += 1;
    prev = tag;
}

It is important that hook functions have as many arguments as iargsthe corresponding QEMU operation. Two externin the header will be linked to runtime during the relocation process. In principle, prevone could define it right here, but then it needs to be defined how static, otherwise it will fall into the COMMON section that I do not support. Actually, we, in fact, simply rewrote the pseudo-code from the documentation, but here it is machine-readable!

To check, create a file bug.c:

#include 
#include 
#include 
int main(int argc, char *argv[])
{
  char buf[16];
  int res = read(0, buf, 4);
  if (buf[0] == 'T' && buf[1] == 'E' && buf[2] == 'S' && buf[3] == 'T')
    abort();
  return res * 0;
}

And also - a file forksrvthat is convenient to feed AFL:

#!/bin/bash
export NATIVE_INST=./instrumentation-examples/afl/afl-native.so
export BPF_INST=./instrumentation-examples/afl/afl-bpf.c.o
exec ./x86_64-linux-user/qemu-x86_64 ./instrumentation-examples/afl/bug

And run fuzzing:

AFL_SKIP_BIN_CHECK=1 afl-fuzz -i ../input -o ../output -m none -- ./forksrv

American Fuzzy Lop

1234
T234
TE34
TES4
TEST <- а это уже в crashes, полторы минуты и 2200 запусков спустя

So far, speed is not so hot, but as an excuse I’ll say that here (so far) an important feature of the original is not used qemu_mode: sending addresses of executable code to the fork server. But there is nothing AFL in the QEMU codebase now, and there is hope that this generalized instrumentation will someday be crammed into upstream.

GitHub Project

Tags: