tyomitch September 5, 2010 at 22:32

Compilation. 9: executable code

I remind you that we are writing a compiler for the toy language j-creak. We started by compiling into p-code , spent a lot of effort on its optimization , and prepared for the final stage of compilation - to output machine-dependent executable code.
There are no intricate algorithms here: by and large, only replacing one system of commands with another.

Further in the post:

Code selection
Loader
Changes to p-code
Generation
What happened?

Code selection

I would venture to assume that you are reading this post on a computer with an x86 or x64 processor inside. I apologize to mobile device users: it would be interesting to generate code for a fashionable gadget, but it’s too difficult for me to get such a copy for experiments.

We will try to generate a "cross-platform" code that is equally functional on x86 and x64. We focus on machine code, and not on the internals of the ELF format and interaction with the bootloader; therefore, we will create the code in a “solid piece”, similar to the dos format COM. Read it from disk and run will be our own bootloader.

We will use only 32-bit registers, so the same program, even if overflows occur during calculations, will give the same results on x64 and x86.

To communicate with the outside world (commands ECHOand INPUT), the generated code will have to somehow call the functions of the operating system. How does he know their addresses?
Our loader will create for the loaded program a “communication area” - an array of pointers to “standard functions”, and pass the address of this array to the program as a parameter. We have three standard functions: input of a number, output of a number, output of a line; therefore, the communication area will consist of three pointers. For the convenience of addressing, immediately at the entrance to the program, save the transmitted address in EBP/ RBP.

Another interesting point is access to memory (to poured variables). Instead of the typical x86 absolute addressing, x64 introduced "relocatable absolute", relative RIP. To really get the code cross-platform, you must use explicit relative addressing.
Without further ado, we will address the cells of the poured variables with respect to the same EBP/ RBP- i.e. immediately after the communication region, by offset 0x18, the variable region begins.

The generated p-code uses four “abstract physical registers” R01..R04, which naturally match real registersEAX,ECX,EDX,EBX: i.e. the number of this register is obtained from the number of "abstract" by subtracting one. If we wanted to use the rest of the processor registers, the correspondence between the numbers would be more complicated.

The way the parameters are passed to the function on x86 and on x64 is very different. On x86, the default parameters are passed through the stack ( cdecl ), but you can ask to gccpass up to three parameters in registers ECX,EDX,EAX( fastcall ). On x64, six parameters are transferred in registers RDI,RSI,RDX,RCX,R8,R9, and it gccdoes not allow changing the way of calling. Due to this incompatibility, it will be necessary to implement both transmission mechanisms in the code generator, and select the necessary one. Fortunately, passing parameters will be the only platform-specific place in our code.
On both processors, the called function has the right to spoil the contents EAX,ECX,EDX, so they will need to be saved before the call (if they are alive), and restored after. Thus, the information about the vitality of physical registers, which we used during their assignment, is also needed when generating the code.

What code will be generated, sorted out; it’s up to the small thing - to realize the plan.

Loader

We take as a basis the long-standing p-code interpreter, leave all the husks (checking input, mapping the file to memory), and replace the gut (the cycle of execution and implementation of commands) with a call to the read code, as if it were a normal c-function. Note that the one that defined the command definitions disappeared : the new bootloader knows nothing about the j-script, nor about our p-code.

#include 

#include 

#include 

#include 

#include 


const char* fdata = NULL; // весь прочитанный код


int input() { int d; scanf(" %d", &d); return d; }

void echoi(int i) { printf("%d", i); }

void echos(int offset) { printf("%s", fdata+offset); }


void* linkarea[1000] = {(void*)input, (void*)echoi, (void*) echos};


int main(int argc, char** argv) {


    if(argc!=2) {

        printf("Missing code file name.\n");

        exit(1);

    }


    int fd = open(argv[1], O_RDONLY);

    if (fd<0) {

        printf("Cannot open code file.\n");

        exit(1);

    }

    struct stat finfo;

    fstat(fd, &finfo);

    fdata = (const char*)mmap(0, finfo.st_size, PROT_READ|PROT_EXEC, MAP_PRIVATE, fd, 0);

    if (!fdata) {

        printf("Cannot read code file.\n");

        exit(1);

    }


    ((void (*)(void**)) fdata)(linkarea); // запуск


    munmap((void*)fdata, finfo.st_size);

    close(fd);

    return 0;

}

#include "jsk.h"

Changes to p-code

Now that the p-code has become an internal representation, invisible outside the compiler, there is no sense in compressing each command into 4 bytes; therefore, no tricky unionor field size restrictions are needed . Replace jsk.hwith the declaration of the usual structure with fields int: A feature of the x86 / x64 instruction set is its diversity and non-orthogonality. Almost any operation can be encoded in three to four different ways. We will try for each n-team to choose the most compact of the possible implementations. (This is the limiting case of “optimization through the peephole” - peephole optimization , which deals not with the program as a whole, but with short blocks of adjacent teams. In general, each team is processed independently of those around it.) For example, the p-command

struct command {

    enum opcodes {

        hlt, store, jz, echo, mov, load, input, add, sub, mul, div,

        eq, ne, ge, le, gt, lt

    };

    opcodes opcode;

    regnum dest, src1, src2;

    int imm;

    command(opcodes opcode, regnum dest, regnum src1, regnum src2) :

            opcode(opcode), dest(dest), src1(src1), src2(src2) {}

    command(opcodes opcode, regnum dest, int imm) :

            opcode(opcode), dest(dest), imm(imm) {}

};

addwe can compile in MOV, in INC, in DEC, in one of three forms ADDor in one of three forms LEA- depending on the combination of arguments.

Some local optimizations, such as replacing mul r, r, 2with add r, r, r, are conveniently performed at the p-code level, before being translated into executable code. At the same stage - after choosing physical registers, but before the final translation - we will get rid of the registers, the value of which is known in all places where they are used: during translation we will use the known value.

// в подстановке констант:

//      ...

        if(i->cmd.opcode==command::mul) { // несколько замен

            if(i->known.count(i->cmd.src1)) std::swap(i->cmd.src1, i->cmd.src2);

            if(i->known.count(i->cmd.src2)) switch(i->known[i->cmd.src2]) {

                case -1: i->cmd = command(command::sub, i->cmd.dest, 0, i->cmd.src1); break;

                case 0:  i->cmd = command(command::mov, i->cmd.dest, 0); break;

                case 1:  nopOut(i); break;

                case 2:  i->cmd = command(command::add, i->cmd.dest, i->cmd.src1, i->cmd.src1); break;

            }

        }


// удалить регистры, значение которых известно всюду, где используется

void postalloc() {

        std::vector needed(lastreg+1);

        foreach(i, pcode) {

            if(i->has2src()) {

                if(!i->known.count(i->cmd.src1)) needed[i->cmd.src1]=true;

                if(!i->known.count(i->cmd.src2)) needed[i->cmd.src2]=true;

                else // src2 известен: есть один особый случай

                    if(i->cmd.opcode==command::div && i->known[i->cmd.src2]!=2)

                        needed[i->cmd.src2]=true; // под делитель нужен регистр

            }

            if(i->readsdest() && !i->known.count(i->cmd.dest))

                needed[i->cmd.dest]=true;

        }

        foreach(i, pcode)

            if(i->writesdest() && !needed[i->cmd.dest])

                nopOut(i);

}


// после успешного выбора всех физических регистров:

            // ...

            postalloc();

            // вычистить NOP-ы

            simpleopt();

            // подставляем физические регистры: и в командах, и в known

            foreach(i, pcode) {

                std::map known;

                if(i->known.count(i->cmd.dest))

                    known[physmap[i->cmd.dest]] = i->known[i->cmd.dest];

                i->cmd.dest = physmap[i->cmd.dest];

                if (i->has2src()) {

                    if(i->known.count(i->cmd.src1))

                        known[physmap[i->cmd.src1]] = i->known[i->cmd.src1];

                    if(i->known.count(i->cmd.src2))

                        known[physmap[i->cmd.src2]] = i->known[i->cmd.src2];

                    i->cmd.src1 = physmap[i->cmd.src1];

                    i->cmd.src2 = physmap[i->cmd.src2];

                }

                i->known = known;

            }

            break;

Generation

All the generated code will be stored in one large vector . We will need several new fields. will determine the position of the team in the vector . In addition, for the and commands , the offset value of which is unknown until the end of the code generation, we will store the index of the unfilled offset in the field . So, on the first pass through the p-code, we generate all the executable code in the vector ; then go through the code a second time and fill in the missing offsets. After that, we print the result of all the code and all the lines. I tried to remove the most uninteresting pieces from the listing, in order to at least slightly reduce it. The full compiler code is available at tyomitch.net.ru/jsk.y.natv.htmlstd::vector code;
struct commandnint offset, length;codeJZECHOint needfixat;

code

// последний этап: генерация кода для x86/x64

const char regsize = sizeof(void*); // 4 либо 8

// пролог

if(regsize==4) { // PUSH EBP / MOV EBP, [ESP+8] / PUSH EDI / PUSH ESI / PUSH EBX

    code.push_back(0x55); code.push_back(0x8b); code.push_back(0x6c); code.push_back(0x24);

    code.push_back(0x08); code.push_back(0x57); code.push_back(0x56); code.push_back(0x53);

} else {         // PUSH EBP / MOV RBP, RDI / PUSH EBX

    code.push_back(0x55); code.push_back(0x48); code.push_back(0x8b); code.push_back(0xef);

    code.push_back(0x53);

}

foreach(i, pcode) {

    int imm, doffset, joffset;

    bool jshort, saveAX, saveDX;

    i->offset = code.size();

    switch(i->cmd.opcode) {

    case command::hlt:

        if(regsize==4) {

            i->emit(0x5b, 0x5e, 0x5f, 0x5d); // POP EBX / POP ESI / POP EDI / POP EBP

            i->emit(0xc2, 2, 0);             // RET 2

        } else

            i->emit(0x5b, 0x5d, 0xc3);       // POP EBX / POP EBP / RET

        break;

    case command::store: // MOV [EBP+18+(imm-1)*4], dst

        doffset = 0x18+(i->cmd.imm-1)*4;

        if(i->known.count(i->cmd.dest)) {

            i->emit(0xc7);

            if(doffset<128)

                i->emit(0x45, (char)doffset);

            else {

                i->emit14(0x85, doffset);

            }

            i->emit4(i->known[i->cmd.dest]);

        } else {

            i->emit(0x89);

            if(doffset<128)

                i->emit(0x45|(i->cmd.dest-1)<<3, (char)doffset);

            else

                i->emit14(0x85|(i->cmd.dest-1)<<3, doffset);

        }

        break;

    case command::jz:

        joffset = i->tgt->offset - (i->offset+2); // предполагаем JMP SHORT

        jshort = i->tgt->offset && (joffset>=-128); // прыжок назад и близко

        if(!i->cmd.dest) { // JMP off

            if(jshort)

                i->emit(0xeb, (char)joffset);

            else {

                i->emit(0xe9);

                if(i->tgt->offset) // прыжок назад

                    i->emit4(joffset-3);

                else {

                    i->needfixat = code.size();

                    i->emit4(0);

                }

            }

            break;

        }

        if(jshort && (i->cmd.dest==2)) { // JECXZ

            i->emit(0xe3, (char)joffset);

            break;

        }

        // OR dst, dst / JZ off

        i->emit(0x0b, 0xc0|(i->cmd.dest-1)<<3|(i->cmd.dest-1));

        if(jshort && (joffset>=-126))

            i->emit(0x74, (char)(joffset-2));

        else {

            i->emit(0x0f, 0x84);

            if(i->tgt->offset) // прыжок назад

                i->emit4(joffset-6);

            else {

                i->needfixat = code.size();

                i->emit4(0);

            }

        }

        break;

    case command::echo:  // PUSH live / PUSH dst / CALL [EBP+?] / ADD ESP, 4 / POP live

        foreach(rp, i->onexitp) if(*rp!=4) i->emit(0x50|(*rp-1));

        if(!i->cmd.dest) { // imm, [EBP+8]

            if(regsize==4) i->emit(0x68); else i->emit(0xbf);

            i->needfixat = code.size();

            i->emit4(0);

            i->emit(0xff, 0x55, 2*regsize);

        } else {

            if(i->known.count(i->cmd.dest)) { // imm, [EBP+4]

                imm = i->known[i->cmd.dest];

                if(regsize==8)

                    i->emit14(0xbf, imm);

                else if((imm>=-128)&&(imm<128))

                    i->emit(0x6a, (char)imm);

                else

                    i->emit14(0x68, imm);

            } else if(regsize==4) // dst, [EBP+4]

                i->emit(0x50|(i->cmd.dest-1));

            else

                i->emit(0x8b, 0xf8|(i->cmd.dest-1));

            i->emit(0xff, 0x55, regsize);

        }

        if(regsize==4) i->emit(0x83, 0xc4, 4);

        foreachr(rp, i->onexitp) if(*rp!=4) i->emit(0x58|(*rp-1));

        break;

    case command::mov:   // MOV dst, imm

        if(i->cmd.imm)

            i->emit14(0xb8|(i->cmd.dest-1), i->cmd.imm);

        else             // XOR dst, dst

            i->emit(0x33, 0xc0|(i->cmd.dest-1)<<3|(i->cmd.dest-1));

        break;

    case command::load:  // MOV dst, [EBP+(3+i)*4]

        doffset = 0x18+(i->cmd.imm-1)*4;

        i->emit(0x8b);

        if(doffset<128)

            i->emit(0x45|(i->cmd.dest-1)<<3, (char)doffset);

        else

            i->emit14(0x85|(i->cmd.dest-1)<<3, doffset);

        break;

    case command::input: // PUSH live / CALL [EBP+0] / XCHG EAX, dst / POP live

        foreach(rp, i->onenterp) if(*rp!=4) i->emit(0x50|(*rp-1));

        i->emit(0xff, 0x55, 0);

        if(i->cmd.dest!=1) i->emit(0x90|(i->cmd.dest-1));

        foreachr(rp, i->onenterp) if(*rp!=4) i->emit(0x58|(*rp-1));

        break;

    case command::add:

        // максимум 1 операнд известен; поставим его в src2

        if(i->known.count(i->cmd.src1) || (i->cmd.src2==i->cmd.dest))

            std::swap(i->cmd.src1, i->cmd.src2);

        if(!i->cmd.src2) // MOV dst, src1

            i->emit(0x8b, 0xc0|(i->cmd.dest-1)<<3|(i->cmd.src1-1));

        else if(i->known.count(i->cmd.src2)) {

            imm = i->known[i->cmd.src2];

    addimm:

            if(i->cmd.dest==i->cmd.src1) // ADD dst, imm

                if(imm==1) // INC dst

                    i->emit(0xff, 0xc0|(i->cmd.dest-1));

                else if(imm==-1) // DEC dst

                    i->emit(0xff, 0xc8|(i->cmd.dest-1));

                else if((imm>=-128)&&(imm<128))

                    i->emit(0x83, 0xc0|(i->cmd.dest-1), (char)imm);

                else // for imm=128 we might use SUB Edst, -128

                    i->emit114(0x81, 0xc0|(i->cmd.dest-1), imm);

            else // LEA dst, [src1+imm]

                if((imm>=-128)&&(imm<128))

                    i->emit(0x8d, 0x40|(i->cmd.dest-1)<<3|(i->cmd.src1-1), (char)imm);

                else

                    i->emit114(0x8d, 0x80|(i->cmd.dest-1)<<3|(i->cmd.src1-1), imm);

        } else if(i->cmd.dest==i->cmd.src1) // ADD dst, src2

            i->emit(0x03, 0xc0|(i->cmd.dest-1)<<3|(i->cmd.src2-1));

        else // LEA dst, [src1+src2]

            i->emit(0x8d, (i->cmd.dest-1)<<3|4, (i->cmd.src1-1)<<3|(i->cmd.src2-1));

        break;

    case command::sub: // ...

    case command::mul: // ...

    case command::div:

        if(i->known.count(i->cmd.src2) && i->known[i->cmd.src2]==2) {

            if(i->cmd.dest!=i->cmd.src1) // MOV dst, src1 / SAR dst, 1

                i->emit(0x8b, 0xc0|(i->cmd.dest-1)<<3|(i->cmd.src1-1));

            i->emit(0xd1, 0xf8|(i->cmd.dest-1));

            break;

        }

        // для деления нужны именно EAX,EDX;

        // сохраним их текущее содержимое в EDI,ESI, а после деления восстановим

        // MOV EDI, EAX / MOV EAX, src1 / MOV ESI, EDX / XOR EDX, EDX

        // IDIV src2 / XCHG dst, EAX / MOV EDX, ESI / MOV EAX, EDI

        saveAX = i->onexitp.count((physreg)1) && (i->cmd.dest!=1);

        saveDX = i->onexitp.count((physreg)3) && (i->cmd.dest!=3);

        if(saveAX || (i->cmd.src2==1)) i->emit(0x8b, 0xf8);

        if(i->cmd.src1!=1)

            if(i->known.count(i->cmd.src1))

                i->emit14(0xb8, i->known[i->cmd.src1]);

            else

                i->emit(0x8b, 0xc0|(i->cmd.src1-1));

        if(saveDX || (i->cmd.src2==3)) i->emit(0x8b, 0xf2);

        i->emit(0x33, 0xd2);

        if(i->cmd.src2==1) // переехал в EDI

            i->emit(0xf7, 0xff);

        else if(i->cmd.src2==3) // переехал ESI

            i->emit(0xf7, 0xfe);

        else

            i->emit(0xf7, 0xf8|(i->cmd.src2-1));

        if(i->cmd.dest!=1)

            i->emit(0x90|(i->cmd.dest-1));

        if(saveDX) i->emit(0x8b, 0xd6);

        if(saveAX) i->emit(0x8b, 0xc7);

        break;

    case command::eq: case command::ne: case command::ge:

    case command::le: case command::gt: case command::lt:

        // максимум 1 операнд известен; поставим его в src2

        if(i->known.count(i->cmd.src1)) {

            std::swap(i->cmd.src1, i->cmd.src2);

            switch(i->cmd.opcode) {

            case command::ge: i->cmd.opcode = command::le; break;

            case command::le: i->cmd.opcode = command::ge; break;

            case command::gt: i->cmd.opcode = command::lt; break;

            case command::lt: i->cmd.opcode = command::gt; break;

            }

        }

        // CMP src1, src2 / SETcc dstL / MOVZX dst, dstL

        if(i->known.count(i->cmd.src2)) {

            imm = i->known[i->cmd.src2];

            if((imm>=-128)&&(imm<128))

                i->emit(0x83, 0xf8|(i->cmd.src1-1), (char)imm);

            else if(i->cmd.src1==1)

                i->emit14(0x3d, imm);

            else

                i->emit114(0x81, 0xf8|(i->cmd.src1-1), imm);

        } else

            i->emit(0x3b, 0xc0|(i->cmd.src1-1)<<3|(i->cmd.src2-1));

        i->emit(0x0f);

        switch(i->cmd.opcode) {

        case command::eq: i->emit(0x94); break;

        case command::ne: i->emit(0x95); break;

        // ...

        }

        i->emit(0xc0|(i->cmd.dest-1));

        i->emit(0x0f, 0xb6, 0xc0|(i->cmd.dest-1)<<3|(i->cmd.dest-1));

        break;

    default:

        yyerror("unknown opcode");

    }

}

// после всего кода -- строки

int offset = code.size();

std::vector offsets(strings.size()+1);

foreach(i, strings) {

    offsets[i->second] = offset;

    offset += i->first.length();

    offset++;

}

// второй проход: подставляем смещения

foreach(i, pcode) if(i->needfixat)

    if(i->cmd.opcode==command::jz)

        i->fix4(i->tgt->offset-(i->offset+i->length));

    else if (i->cmd.opcode==command::echo)

        i->fix4(offsets[i->cmd.imm]);


write(1, &*code.begin(), code.size()); // вывод кода

foreach(i, strings)                     // вывод строк

    write(1, i->first.c_str(), i->first.length()+1);

What happened?

We compile the compiler, compile a test program with it, compile the bootloader, and run: I wonder what our program compiled into?

[tyomitch@home ~]$ bison jsk.y

[tyomitch@home ~]$ c++ jsk.tab.c lex.yy.c -o jskc

[tyomitch@home ~]$ ./jskc < test.jsk > code

[tyomitch@home ~]$ cc jskld.c -o jskld

[tyomitch@home ~]$ ./jskld code

Задумай число от 0 до 1000, а я буду угадывать

Это 500? (1=меньше, 2=больше, 3=попал) 1

Это 249? (1=меньше, 2=больше, 3=попал) 3

Ура! Я молодец!

00:55 push rbp
01: 48 8b ef mov rbp, rdi
04:39 push rbx
05:33 c0 xor eax, eax
07: b9 e8 03 00 00 mov ecx, 0x3e8
0c: 50 push rax
0d: 51 push rcx
0e: bf 80 01 00 00 mov edi, 0x180
13: ff 55 10 call qword ptr [rbp + 0x10]
16: 59 pop rcx
17: 58 pop rax
18: 50 push rax
19: 51 push rcx
1a: bf 00 00 00 00 mov edi, 0x0
1f: ff 55 08 call qword ptr [rbp + 0x8]
22: 59 pop rcx
23: 58 pop rax
24: 50 push rax
25: 51 push rcx
26: bf fa 00 00 00 mov edi, 0xfa
2b: ff 55 10 call qword ptr [rbp + 0x10]
2e: 59 pop rcx
2f: 58 pop rax
30: 50 push rax
31: 51 push rcx
32: bf e8 03 00 00 mov edi, 0x3e8
37: ff 55 08 call qword ptr [rbp + 0x8]
3a: 59 pop rcx
3b: 58 pop rax
3c: 50 push rax
3d: 51 push rcx
3e: bf 01 01 00 00 mov edi, 0x101
43: ff 55 10 call qword ptr [rbp + 0x10]
46: 59 pop rcx
47: 58 pop rax
48: 3b c1 cmp eax, ecx
4a: 0f 9e c2 setle dl
4d: 0f b6 d2 movzx edx, dl
50: 0b d2 or edx, edx
52: 0f 84 97 00 00 00 je 0xef
58: 8d 14 01 lea edx, [rcx + rax * 1]
5b: d1 fa sar edx, 1
5d: 50 push rax
5e: 51 push rcx
5f: 52 push rdx
60: bf bc 01 00 00 mov edi, 0x1bc
65: ff 55 10 call qword ptr [rbp + 0x10]
68: 5a pop rdx
69: 59 pop rcx
6a: 58 pop rax
6b: 50 push rax
6c: 51 push rcx
6d: 52 push rdx
6e: 8b fa mov edi, edx
70: ff 55 08 call qword ptr [rbp + 0x8]
73: 5a pop rdx
74: 59 pop rcx
75: 58 pop rax
76: 50 push rax
77: 51 push rcx
78: 52 push rdx
79: bf 26 01 00 00 mov edi, 0x126
7e: ff 55 10 call qword ptr [rbp + 0x10]
81: 5a pop rdx
82: 59 pop rcx
83: 58 pop rax
84: 50 push rax
85: 51 push rcx
86: 52 push rdx
87: ff 55 00 call qword ptr [rbp + 0x0]
8a: 93 xchg ebx, eax
8b: 5a pop rdx
8c: 59 pop rcx
8d: 58 pop rax
8e: 89 45 18 mov dword ptr [rbp + 0x18], eax
91: 83 fb 01 cmp ebx, 0x1
94: 0f 94 c0 sete al
97: 0f b6 c0 movzx eax, al
9a: 0b c0 or eax, eax
9c: 0f 84 08 00 00 00 je 0xaa
a2: 8b 45 18 mov eax, dword ptr [rbp + 0x18]
a5: 8d 4a ff lea ecx, [rdx-0x1]
a8: eb 9e jmp 0x48
aa: 83 fb 02 cmp ebx, 0x2
ad: 0f 94 c0 sete al
b0: 0f b6 c0 movzx eax, al
b3: 0b c0 or eax, eax
b5: 0f 84 04 00 00 00 je 0xbf
bb: 03 c2 add eax, edx
bd: eb 89 jmp 0x48
bf: 8b 45 18 mov eax, dword ptr [rbp + 0x18]
c2: 83 fb 03 cmp ebx, 0x3
c5: 0f 94 c2 sete dl
c8: 0f b6 d2 movzx edx, dl
cb: 0b d2 or edx, edx
cd: 0f 84 0b 00 00 00 je 0xde
d3: bf a0 01 00 00 mov edi, 0x1a0
d8: ff 55 10 call qword ptr [rbp + 0x10]
db: 5b pop rbx
dc: 5d pop rbp
dd: c3 ret
de: 50 push rax
df: 51 push rcx
e0: bf c4 01 00 00 mov edi, 0x1c4
e5: ff 55 10 call qword ptr [rbp + 0x10]
e8: 59 pop rcx
e9: 58 pop rax
ea: e9 59 ff ff ff jmp 0x48
ef: bf 59 01 00 00 mov edi, 0x159
f4: ff 55 10 call qword ptr [rbp + 0x10]
f7: 5b pop rbx
f8: 5d pop rbp
f9: c3 ret

With the naked eye you can see at least two relevant optimizations that need an “eye” for more than one team:

when the command ends on POP reg, and the next starts on PUSH reg, both of these instructions can be deleted;
the combination SETcc regl / MOVZX reg, regl / OR reg, reg / JZ offsetcan be replaced with one instruction JNcc offset, if regnot alive at the exit JZ.

This “big peephole optimization” is what we’ll do next time. The more important remaining task is to generate a real ELF so that you don't need a makeshift bootloader.

Tags: