Part 0. Requires an elf to work in the Matrix. Relocation is possible

    Caution: contains system programming. Yes, in essence, it does not contain anything else.

    Let's imagine that you were given the task of writing a fantasy fantasy game. Well, there about the elves. And about virtual reality. From childhood, you dreamed of writing something like that and, without hesitation, agree. Soon, you realize that you know about the world of elves for the most part from jokes from the old bashorgh and other disparate sources. Oops, a problem. Well, where ours didn’t disappear ... Taught by rich programming experience, you go to Google, enter the "Elf specification" and follow the links. ABOUT! This one leads to some kind of PDF ... so what we have here ... some Elf32_Sword- elven swords - it seems like what you need. 32 is apparently the character’s level, and the two fours in the following columns are probably damage. Exactly what you need, and besides how systematized! ..

    As stated in one Olympiad programming task, after a couple of paragraphs of a detailed text on the topic of Japan, samurai and geisha: "As you already understood, the task will not be about that at all." Oh yes, the contest was, of course, for a while. In general, I declare five minutes of tenacity closed.

    Today I’ll try to talk about parsing a file in 64-bit ELF format . In principle, what they just do not store in it are native programs, static libraries, dynamic libraries, every implementation specific, like crashdumps ... It is used, for example, on Linux and many other Unix-like systems, yes, they say, even on phones its support was actively stuffed in patched firmware before. It would seem that supporting the format for storing programs from serious operating systems should be difficult. So I thought. Yes, it probably is. But we will support a very specific use case: loading eBPF bytecode from .o.files. Why is that? Just for further experiments, I will need some serious (that is, not knee-high) a cross-platform bytecode that can be obtained from C rather than manually written, so eBPF is simple and has an LLVM backend for it. And I just need to parse ELF as a container into which this bytecode is put by the compiler.

    Just in case, I’ll clarify: the article is exploratory programming and does not claim to be an exhaustive guide. The ultimate goal is to make a bootloader that will allow you to read C programs compiled in eBPF using Clang - the ones I have - in a volume sufficient to continue the experiments.


    Starting at zero offset in the ELF lies the header. It contains the very letters E, L, F, which can be seen if you try to open it with a text editor, and some global variables. Actually, the header is the only structure in the file located at a fixed offset, and it contains information to find the rest of the structure. (Hereinafter, I am guided by the documentation for the 32-bit format and elf.hknowing about 64-bit. So, if you notice errors, feel free to correct them)

    The first thing that meets us in the file is the field unsigned char e_ident[16]. Remember these fun articles in the “all of the following statements are false” series? Here it’s about the same: ELF can contain 32- or 64-bit code, Little or Big Endian, and even a dozen processor architectures. You are going to read it as Elf64 under Little endian - well, good luck ... This array of bytes is a kind of signature of what is inside and how to parse it.

    With the first four bytes, everything is simple - this [0x7f, 'E', 'L', 'F']. If they do not match, then there is reason to believe that they are some kind of wrong bees. The next byte contains the class.character файла: ELFCLASS32 или ELFCLASS64 — разрядность. Для простоты мы будем работать только с 64-битными файлами (а бывает ли 32-битный eBPF?). Если класс оказался ELFCLASS32 — просто выходим с ошибкой: всё равно структуры «поплывут», а sanity check сделать не помешает. Последний интересующий нас байт в этой структуре указывает на endianness файла — будем работать только с «родным» для нашего процессора порядком байт.

    На всякий случай уточню: работая с форматом ELF на C не следует вычитывать каждый инт по хитро вычисленному смещению — elf.h содержит необходимые структуры, и даже номера байтов в e_ident: EI_MAG0, EI_MAG1, EI_MAG2, EI_MAG3, EI_CLASS, EI_DATA… Нужно просто привести указатель на вычитанные или отображённые в память данные из файла к указателю на структуру и читать.

    In addition to the e_identheader, there are other fields, some we just check, and some we use for further analysis, but then. Namely, we check that e_machine == EM_BPF(ie he is "under eBPF processor architecture") e_type == ET_REL, e_shoff != 0. The last check has the following meaning: a file can contain information for linking (section table and sections), for launching (program table and segments), or both. With the last two checks, we check that the information we need (as if for linking) is in the file. Also check that the format version matters EV_CURRENT.

    I’ll make a reservation right away, I won’t check the validity of the file, assuming that if we load it into our process, then we trust it. In the code of the kernel or other programs that work with untrusted files, it is naturally impossible to do this in any case .

    Section table

    As I said, we are interested in the linking view of the file, that is, the section table and the sections themselves. Information on where to look for the section table is in the header. Its size is also indicated there, as well as the size of one element - it may be larger than sizeof(Elf64_Shdr)(how it will affect the version number of the format, I honestly do not know). Some major section numbers are reserved, and are not actually present in the table. Referencing them has special meaning. We are apparently interested in only SHN_UNDEF(zero is also reserved - the missing section; by the way, as you know, its title is still in the table) SHN_ABS. The symbol “defined in a section SHN_UNDEF” is actually undefined, but in SHN_ABS- it actually has an absolute meaning and is not relocated. However,SHN_ABS I don’t seem to need it yet either.

    Row table

    Here we come across for the first time string tables - tables of strings used in a file. In fact, if it const char *strtabis a table of strings, then the name sh_nameis simple strtab + sh_name. Yes, it’s just a line starting with a certain index and continuing to zero byte. Lines may intersect (more precisely, one may be the suffix of the other). Sections can have names, then in the ELF Header the field e_shstrndxwill point to a section of the row table (the one for section names, if there are several), and the field sh_namein the section header to a specific line.

    The first (zero) and last bytes of the row table contain null characters. The latter is understandable why: value-hour, ends the last line. But the zero offset specifies an absent or empty name - depending on the context.

    Loading sections

    There are two addresses in the header of each section: one sh_addris the load address (where the section will be placed in memory), the other sh_offsetis the offset in the file at which this section lies there. I don’t know how both are, but each of these values ​​individually can be 0: in one case, the section “remains on the disk”, because there is some kind of service information. In another, the section is not loaded from the disk , for example, you just need to select it and score it with zeros ( .bss). Honestly, while I did not have to process the download address - where it was uploaded, it uploaded there :) However, we have specific programs, frankly, too.


    And now the interesting part: for safety reasons, as you know, they don’t go to the Matrix without an operator remaining at the base. And since we still have fantasy here, the connection with the operator will be telepathic. Oh yes, I announced five minutes of tenacity completed. In general, we will briefly discuss the linking process.

    For my experiment, I need a piece of code compiled into a regular so-shku loaded by a regular one libdl. Here I will not even describe in detail - just open dlopen, pull out the characters through dlsym, close the program at the end of the program dlclose. However, even this is already implementation details that are not related to our ELF file loader. There is simply some context : the ability to get a pointer by name.

    In general, the eBPF instruction set is the triumph of aligned machine code: an instruction always takes 8 bytes and has a structure

    struct {
      uint8_t opcode;
      uint8_t dst:4;
      uint8_t src:4;
      uint16_t offset;
      uint32_t imm;

    Moreover, many fields in each specific instruction may not be used - saving space for a "machine" code is not about us.

    In fact, the first instruction can immediately follow the second one, which does not contain any opcodes, but simply extends the immediate field from 32 to 64 bits. Here is a patch for such a compound instruction and is called R_BPF_64_64.

    In order to perform relocation, once again we will look at the table of sections for an object sh_type == SHT_REL. The sh_infoheader field will indicate which section we are patching, and sh_link- from which table to take a description of the characters.

    typedef struct
      Elf64_Addr    r_offset;
      Elf64_Xword   r_info;
    } Elf64_Rel;

    Actually, there are two types of relocation sections: RELand RELA- the second one explicitly contains an additional term, but I haven’t seen it yet, so we just add assertion to the fact that it really doesn’t meet, and we will process it. Next, I will add to the value that is written in the instructions, the address of the symbol. And where to get it? Here, as we already know, options are possible:

    • The symbol refers to the section SHN_ABS. Then just takest_value
    • The character refers to the `SHN_UNDEF section. Then pull the outer symbol
    • In other cases, just patch the link to another section of the same file`

    How to try it yourself

    First, what to read? In addition to the already specified specification, it makes sense to read this file , in which the iovisor team collects information extracted from Linux kernel via eBPF.

    Secondly, how, in fact, should everyone work with this? First you need to get the ELF file from somewhere. As stated at StackOverfow , the team will help us.

    clang -O2 -emit-llvm -c bpf.c -o - | llc -march=bpf -filetype=obj -o bpf.o

    Secondly, you need to somehow get a reference analysis of the file into pieces. In a normal situation, the team would help us objdump:

    $ objdump
    Использование: objdump <параметры> <файл(ы)>
     Отображает информацию из объекта <файл(ы)>.
     Должен быть указан по крайней мере один из следующих ключей:
      -a, --archive-headers    Display archive header information
      -f, --file-headers       Display the contents of the overall file header
      -p, --private-headers    Display object format specific file header contents
      -P, --private=OPT,OPT... Display object format specific contents
      -h, --[section-]headers  Display the contents of the section headers
      -x, --all-headers        Display the contents of all headers
      -d, --disassemble        Display assembler contents of executable sections
      -D, --disassemble-all    Display assembler contents of all sections
          --disassemble=  Display assembler contents from 
      -S, --source             Intermix source code with disassembly
      -s, --full-contents      Display the full contents of all sections requested
      -g, --debugging          Display debug information in object file
      -e, --debugging-tags     Display debug information using ctags style
      -G, --stabs              Display (in raw form) any STABS info in the file
      -W[lLiaprmfFsoRtUuTgAckK] or
                               Display DWARF info in the file
      -t, --syms               Display the contents of the symbol table(s)
      -T, --dynamic-syms       Display the contents of the dynamic symbol table
      -r, --reloc              Display the relocation entries in the file
      -R, --dynamic-reloc      Display the dynamic relocation entries in the file
      @                  Read options from 
      -v, --version            Display this program's version number
      -i, --info               List object formats and architectures supported
      -H, --help               Display this information

    But in this case, it is powerless:

    $ objdump -d test-bpf.o 
    test-bpf.o:     формат файла elf64-little
    objdump: невозможно выполнить дизассемблирование для архитектуры UNKNOWN!

    More precisely, it will show sections, but disassembling is a problem. Here we recall what we collected using LLVM. And LLVM has its own extended analogues of utilities from binutils, with view names llvm-<имя команды>. They, for example, understand LLVM bitcode. And they also understand eBPF - for sure it depends on the compilation options, but since it compiled, it probably should always be parsed. Therefore, for convenience, I recommend creating a script:

    vim test-bpf.c # Подставить редактор по вкусу
    clang -Oz -emit-llvm -c test-bpf.c -o - | llc -march=bpf -filetype=obj -o test-bpf.o
    llvm-objdump -d -t -r test-bpf.o

    Then for such a source:

    extern uint64_t z;
    uint64_t func(uint64_t x, uint64_t y)
        return x + y + z;

    There will be such a result:

    $ ./ 
    test-bpf.o:     file format ELF64-BPF
    Disassembly of section .text:
    0000000000000000 func:
           0:       bf 20 00 00 00 00 00 00         r0 = r2
           1:       0f 10 00 00 00 00 00 00         r0 += r1
           2:       18 01 00 00 00 00 00 00 00 00 00 00 00 00 00 00         r1 = 0 ll
                    0000000000000010:  R_BPF_64_64  z
           4:       79 11 00 00 00 00 00 00         r1 = *(u64 *)(r1 + 0)
           5:       0f 10 00 00 00 00 00 00         r0 += r1
           6:       95 00 00 00 00 00 00 00         exit
    0000000000000000 l    df *ABS*           00000000 test-bpf.c
    0000000000000000 l    d  .text           00000000 .text
    0000000000000000 g     F .text           00000038 func
    0000000000000000         *UND*           00000000 z

    Code .

    Part 1. QInst: it is better to lose a day, then fly in five minutes (we write the instrumentation trivially)

    Also popular now: