
Part 1. QInst: it’s better to lose a day, then fly in five minutes (writing instruments is trivial)
In the previous part, I roughly described how you can load eBPF functions from an ELF file. Now it’s time to move from fantasy to Soviet cartoons, and following wise advice, after spending a certain amount of effort once, make a universal instrumentation tool(or, in short, UII !!!). In doing so, I will take advantage of the Golden Hammer antipattern design and build a tool from the relatively familiar QEMU . As a bonus for this, we get cross-architectural instrumentation, as well as instrumentation at the level of the whole virtual computer. Instrumentation will be of the form “a small native so-file + a small .o-file with eBPF”. In this case, eBPF functions will be substituted before the corresponding instructions of the internal representation of QEMU before optimization and code generation.
As a result, the instrumentation itself, which is added during code generation (that is, not counting a couple of kilobytes of normal system runtime), looks like this, and this is not a pseudo-code:
#include
extern uint8_t *__afl_area_ptr;
extern uint64_t prev;
void inst_qemu_brcond_i64(uint64_t tag, uint64_t x, uint64_t y, uint64_t z, uint64_t u)
{
__afl_area_ptr[((prev >> 1) ^ tag) & 0xFFFF] += 1;
prev = tag;
}
void inst_qemu_brcond_i32(uint64_t tag, uint64_t x, uint64_t y, uint64_t z, uint64_t u)
{
__afl_area_ptr[((prev >> 1) ^ tag) & 0xFFFF] += 1;
prev = tag;
}
Well, it's time to load our elf into the Matrix. Well, how to download, rathersmack spray.
As already mentioned in the article about QEMU.js , one of the QEMU modes of operation is JIT generation of host machine code from guest (potentially, for a completely different architecture). If the last time I implemented my code generation backend, then this time I am going to process the internal representation by wedging right in front of the optimizer. Is this an arbitrary decision? Not. There is a hope that the optimizer will cut off excess corners, throw out unnecessary variables, etc. As far as I understand, he, in fact, does simple and quickly doable things: pushing constants, throwing out expressions like “x: = x + 0” and deleting unreachable code. And we can get a decent amount of it.
Assembly script configuration
First of all, let's add our source files: tcg/bpf-loader.c
and tcg/instrument.c
in the Makefile-s. Generally speaking, there is a desire to someday push this into upstream, so you will need to do it in the end wisely, but for now I will just unconditionally add these files to the assembly. And I will take the parameters in the best traditions of AFL - through environment variables. By the way, I will test this again on the instrumentation for AFL.
Just look for the mention of the "neighbor" - the file optimize.c
with the help grep -R
and find nothing. Because it was necessary to search optimize.o
:
--- a/Makefile.target
+++ b/Makefile.target
@@ -110,7 +110,7 @@ obj-y += trace/
obj-y += exec.o
obj-y += accel/
obj-$(CONFIG_TCG) += tcg/tcg.o tcg/tcg-op.o tcg/tcg-op-vec.o tcg/tcg-op-gvec.o
-obj-$(CONFIG_TCG) += tcg/tcg-common.o tcg/optimize.o
+obj-$(CONFIG_TCG) += tcg/tcg-common.o tcg/optimize.o tcg/instrument.o tcg/bpf-loader.o
obj-$(CONFIG_TCG_INTERPRETER) += tcg/tci.o
obj-$(CONFIG_TCG_INTERPRETER) += disas/tci.o
obj-$(CONFIG_TCG) += fpu/softfloat.o
So here you are, metaprogramming in C ...
To begin with, we add bpf-loader.c
from the last series a code that pulls out entry points corresponding to QEMU operations. And a mysterious file will help us with this tcg-opc.h
. It looks like this:
/*
* DEF(name, oargs, iargs, cargs, flags)
*/
/* predefined ops */
DEF(discard, 1, 0, 0, TCG_OPF_NOT_PRESENT)
DEF(set_label, 0, 0, 1, TCG_OPF_BB_END | TCG_OPF_NOT_PRESENT)
/* variable number of parameters */
DEF(call, 0, 0, 3, TCG_OPF_CALL_CLOBBER | TCG_OPF_NOT_PRESENT)
DEF(br, 0, 0, 1, TCG_OPF_BB_END)
// ...
What nonsense? And the thing is simply that it is not connected in the source header - you need to define a macro DEF
, include this file, and immediately delete the macro. See, he doesn't even have guard.
static const char *inst_function_names[] = {
#define DEF(name, a, b, c, d) stringify(inst_qemu_##name),
#include "tcg-opc.h"
#undef DEF
NULL
};
As a result, we get a neat array of target function names, indexed by opcodes and ending with NULL, which we can run for each character in the file. I understand that this is not effective. But it’s simple, which is important, given the one-time nature of this operation. Next, we just skip all the characters for which
ELF64_ST_BIND(sym->st_info) == STB_LOCAL || ELF64_ST_TYPE(sym->st_info) != STT_FUNC
The rest is checked against the list.
We are attached to a flow of execution
Now you need to get up somewhere on the flow of the code generation mechanism, and wait until the instruction of interest passes by. But first you need to define your functions instrumentation_init
, tcg_instrument
and write their calls instrumentation_shutdown
in the file tcg/tcg.h
: initialization - after the backend is initialized, instrumentation - right before the call tcg_optimize
. It would seem that instrumentation_shutdown
you can hang in instrumentation_init
on atexit
and not take a steam bath. I thought so too, and most likely it will work in the full system emulation mode, but in the usermode emulation mode QEMU translates system calls exit_group
and sometimes exit
into a function call _exit
that ignores all these atexit-handlers, so we’ll find it in linux-user/syscall.c
and write before it the call of our code.
Interpreting Bytecode
So it’s time to read what the compiler generated for us. This is conveniently done with the help of llvm-objdump
the parameter -x
, but better immediately -d -t -r
.
$ ./compile-bpf.sh
test-bpf.o: file format ELF64-BPF
Disassembly of section .text:
0000000000000000 inst_brcond_i64:
0: 18 02 00 00 00 00 00 00 00 00 00 00 00 00 00 00 r2 = 0 ll
0000000000000000: R_BPF_64_64 prev
2: 79 23 00 00 00 00 00 00 r3 = *(u64 *)(r2 + 0)
3: 77 03 00 00 01 00 00 00 r3 >>= 1
4: 7b 32 00 00 00 00 00 00 *(u64 *)(r2 + 0) = r3
5: af 13 00 00 00 00 00 00 r3 ^= r1
6: 57 03 00 00 ff ff 00 00 r3 &= 65535
7: 18 04 00 00 00 00 00 00 00 00 00 00 00 00 00 00 r4 = 0 ll
0000000000000038: R_BPF_64_64 __afl_area_ptr
9: 79 44 00 00 00 00 00 00 r4 = *(u64 *)(r4 + 0)
10: 0f 34 00 00 00 00 00 00 r4 += r3
11: 71 43 00 00 00 00 00 00 r3 = *(u8 *)(r4 + 0)
12: 07 03 00 00 01 00 00 00 r3 += 1
13: 73 34 00 00 00 00 00 00 *(u8 *)(r4 + 0) = r3
14: 7b 12 00 00 00 00 00 00 *(u64 *)(r2 + 0) = r1
15: 95 00 00 00 00 00 00 00 exit
0000000000000080 inst_brcond_i32:
16: 18 02 00 00 00 00 00 00 00 00 00 00 00 00 00 00 r2 = 0 ll
0000000000000080: R_BPF_64_64 prev
18: 79 23 00 00 00 00 00 00 r3 = *(u64 *)(r2 + 0)
19: 77 03 00 00 01 00 00 00 r3 >>= 1
20: 7b 32 00 00 00 00 00 00 *(u64 *)(r2 + 0) = r3
21: af 13 00 00 00 00 00 00 r3 ^= r1
22: 57 03 00 00 ff ff 00 00 r3 &= 65535
23: 18 04 00 00 00 00 00 00 00 00 00 00 00 00 00 00 r4 = 0 ll
00000000000000b8: R_BPF_64_64 __afl_area_ptr
25: 79 44 00 00 00 00 00 00 r4 = *(u64 *)(r4 + 0)
26: 0f 34 00 00 00 00 00 00 r4 += r3
27: 71 43 00 00 00 00 00 00 r3 = *(u8 *)(r4 + 0)
28: 07 03 00 00 01 00 00 00 r3 += 1
29: 73 34 00 00 00 00 00 00 *(u8 *)(r4 + 0) = r3
30: 7b 12 00 00 00 00 00 00 *(u64 *)(r2 + 0) = r1
31: 95 00 00 00 00 00 00 00 exit
SYMBOL TABLE:
0000000000000000 l df *ABS* 00000000 test-bpf.c
0000000000000000 l d .text 00000000 .text
0000000000000000 *UND* 00000000 __afl_area_ptr
0000000000000080 g F .text 00000080 inst_brcond_i32
0000000000000000 g F .text 00000080 inst_brcond_i64
0000000000000008 g O *COM* 00000008 prev
If you try to look for a description of the eBPF opcodes, it turns out that in obvious places (source and man-pages of the Linux kernel) there are descriptions of how to use it, how to compile, etc. Then you come across the iovisor tool team page with a convenient unofficial eBPF reference.
The instruction occupies one 64-bit word (some two) and has the form
struct {
uint8_t opcode;
uint8_t dst:4;
uint8_t src:4;
uint16_t offset;
uint32_t imm;
};
Those that occupy two words simply consist of the first instruction with all the logic and a “trailer” with 32 more bits of immediate value and are very clearly visible on the objdump disassembler.
The opcodes themselves also have a regular structure: the lower three bits are the operation class: 32-bit ALU, 64-bit ALU, load / store, conditional branching. Therefore, it is very convenient to implement them on macros in the best traditions of QEMU. I will not conduct detailed instructions on the code basewe are not on code reviewI’d better tell you about the pitfalls.
My first problem was that I made a lazy eBPF register allocator in the form of QEMUs local_temp
, and began to thoughtlessly transfer the call of this function to the macro. It turned out like in a famous meme: "We inserted an abstraction into an abstraction so that you can generate an instruction while you generate an instruction." Post factum, I already do not understand very well what was broken then, but something strange was apparently happening with the order of the generated instructions. After that, I made analogs of functions tcg_gen_...
to push new instructions into the middle of the list, taking operands as arguments to the function, and the order automatically became as it should (since the arguments are completely calculated exactly once before the call).
The second problem was trying to push the TCG const as the operand of an arbitrary instruction when looking at the immediate operand in eBPF. According to the aforementioned tcg-opc.h
, the composition of the argument list of the operation is strictly fixed: n
input arguments, m
output and k
constant. By the way, when debugging such code, it really helps to pass QEMU a command line argument -d op,op_opt
or even -d op,op_opt,out_asm
.
$ ./x86_64-linux-user/qemu-x86_64 -d help
Log items (comma separated):
out_asm show generated host assembly code for each compiled TB
in_asm show target assembly code for each compiled TB
op show micro ops for each compiled TB
op_opt show micro ops after optimization
op_ind show micro ops before indirect lowering
int show interrupts/exceptions in short format
exec show trace before each executed TB (lots of logs)
cpu show CPU registers before entering a TB (lots of logs)
fpu include FPU registers in the 'cpu' logging
mmu log MMU-related activities
pcall x86 only: show protected mode far calls/returns/exceptions
cpu_reset show CPU state before CPU resets
unimp log unimplemented functionality
guest_errors log when the guest OS does something invalid (eg accessing a
non-existent register)
page dump pages at beginning of user mode emulation
nochain do not chain compiled TBs so that "exec" and "cpu" show
complete traces
trace:PATTERN enable trace events
Use "-d trace:help" to get a list of trace events.
Well, do not repeat my mistakes: the disassembler of internal instructions is quite advanced, and if you see something like it add_i64 loc15,loc15,$554412123213
, then this thing after the dollar sign - this is not a pointer. More precisely, this, of course, is a pointer, but perhaps hung with flags and in the role of the literal value of the operand, and not the pointer. All this is applicable, of course, if you know that there must be some specific number, like $0
or $ff
, you don’t need to be afraid of pointers at all. :) How to deal with this - you just need to create a function that returns a fresh one temp
, in which it movi
will put the desired constant.
By the way, if you tcg/tcg.c
comment out in the header #define USE_TCG_OPTIMIZATIONS
, then, suddenly, optimization will turn off, and it will be easier to analyze code transformations.
For sim, I will send a reader interested in picking QEMU into the documentation , even the official one! For the rest, I will demonstrate the promised instrumentation for AFL.
The same and the rabbit
For the full text of runtime, I, again, will send the reader to the repository, since it (the text) is not of artistic value and honestly stolen from the qemu_mode
AFL delivery, and in general, is just an ordinary piece of C code. But here is the instrumentation itself:
#include
extern uint8_t *__afl_area_ptr;
extern uint64_t prev;
void inst_qemu_brcond_i64(uint64_t tag, uint64_t x, uint64_t y, uint64_t z, uint64_t u)
{
__afl_area_ptr[((prev >> 1) ^ tag) & 0xFFFF] += 1;
prev = tag;
}
void inst_qemu_brcond_i32(uint64_t tag, uint64_t x, uint64_t y, uint64_t z, uint64_t u)
{
__afl_area_ptr[((prev >> 1) ^ tag) & 0xFFFF] += 1;
prev = tag;
}
It is important that hook functions have as many arguments as iargs
the corresponding QEMU operation. Two extern
in the header will be linked to runtime during the relocation process. In principle, prev
one could define it right here, but then it needs to be defined how static
, otherwise it will fall into the COMMON section that I do not support. Actually, we, in fact, simply rewrote the pseudo-code from the documentation, but here it is machine-readable!
To check, create a file bug.c
:
#include
#include
#include
int main(int argc, char *argv[])
{
char buf[16];
int res = read(0, buf, 4);
if (buf[0] == 'T' && buf[1] == 'E' && buf[2] == 'S' && buf[3] == 'T')
abort();
return res * 0;
}
And also - a file forksrv
that is convenient to feed AFL:
#!/bin/bash
export NATIVE_INST=./instrumentation-examples/afl/afl-native.so
export BPF_INST=./instrumentation-examples/afl/afl-bpf.c.o
exec ./x86_64-linux-user/qemu-x86_64 ./instrumentation-examples/afl/bug
And run fuzzing:
AFL_SKIP_BIN_CHECK=1 afl-fuzz -i ../input -o ../output -m none -- ./forksrv
1234
T234
TE34
TES4
TEST <- а это уже в crashes, полторы минуты и 2200 запусков спустя
So far, speed is not so hot, but as an excuse I’ll say that here (so far) an important feature of the original is not used qemu_mode
: sending addresses of executable code to the fork server. But there is nothing AFL in the QEMU codebase now, and there is hope that this generalized instrumentation will someday be crammed into upstream.