Compilation. 9: executable code
I remind you that we are writing a compiler for the toy language j-creak. We started by compiling into p-code , spent a lot of effort on its optimization , and prepared for the final stage of compilation - to output machine-dependent executable code.
There are no intricate algorithms here: by and large, only replacing one system of commands with another.
I would venture to assume that you are reading this post on a computer with an x86 or x64 processor inside. I apologize to mobile device users: it would be interesting to generate code for a fashionable gadget, but it’s too difficult for me to get such a copy for experiments.
We will try to generate a "cross-platform" code that is equally functional on x86 and x64. We focus on machine code, and not on the internals of the ELF format and interaction with the bootloader; therefore, we will create the code in a “solid piece”, similar to the dos format COM. Read it from disk and run will be our own bootloader.
We will use only 32-bit registers, so the same program, even if overflows occur during calculations, will give the same results on x64 and x86.
To communicate with the outside world (commands
Our loader will create for the loaded program a “communication area” - an array of pointers to “standard functions”, and pass the address of this array to the program as a parameter. We have three standard functions: input of a number, output of a number, output of a line; therefore, the communication area will consist of three pointers. For the convenience of addressing, immediately at the entrance to the program, save the transmitted address in
Another interesting point is access to memory (to poured variables). Instead of the typical x86 absolute addressing, x64 introduced "relocatable absolute", relative
Without further ado, we will address the cells of the poured variables with respect to the same
The generated p-code uses four “abstract physical registers”
The way the parameters are passed to the function on x86 and on x64 is very different. On x86, the default parameters are passed through the stack ( cdecl ), but you can ask to
On both processors, the called function has the right to spoil the contents
What code will be generated, sorted out; it’s up to the small thing - to realize the plan.
We take as a basis the long-standing p-code interpreter, leave all the husks (checking input, mapping the file to memory), and replace the gut (the cycle of execution and implementation of commands) with a call to the read code, as if it were a normal c-function. Note that the one that defined the command definitions disappeared : the new bootloader knows nothing about the j-script, nor about our p-code.
Now that the p-code has become an internal representation, invisible outside the compiler, there is no sense in compressing each command into 4 bytes; therefore, no tricky
Some local optimizations, such as replacing
All the generated code will be stored in one large vector . We will need several new fields. will determine the position of the team in the vector . In addition, for the and commands , the offset value of which is unknown until the end of the code generation, we will store the index of the unfilled offset in the field . So, on the first pass through the p-code, we generate all the executable code in the vector ; then go through the code a second time and fill in the missing offsets. After that, we print the result of all the code and all the lines. I tried to remove the most uninteresting pieces from the listing, in order to at least slightly reduce it. The full compiler code is available at tyomitch.net.ru/jsk.y.natv.html
We compile the compiler, compile a test program with it, compile the bootloader, and run: I wonder what our program compiled into?
With the naked eye you can see at least two relevant optimizations that need an “eye” for more than one team:
There are no intricate algorithms here: by and large, only replacing one system of commands with another.
Further in the post:
- Code selection
- Loader
- Changes to p-code
- Generation
- What happened?
Code selection
I would venture to assume that you are reading this post on a computer with an x86 or x64 processor inside. I apologize to mobile device users: it would be interesting to generate code for a fashionable gadget, but it’s too difficult for me to get such a copy for experiments.
We will try to generate a "cross-platform" code that is equally functional on x86 and x64. We focus on machine code, and not on the internals of the ELF format and interaction with the bootloader; therefore, we will create the code in a “solid piece”, similar to the dos format COM. Read it from disk and run will be our own bootloader.
We will use only 32-bit registers, so the same program, even if overflows occur during calculations, will give the same results on x64 and x86.
To communicate with the outside world (commands
ECHO
and INPUT
), the generated code will have to somehow call the functions of the operating system. How does he know their addresses? Our loader will create for the loaded program a “communication area” - an array of pointers to “standard functions”, and pass the address of this array to the program as a parameter. We have three standard functions: input of a number, output of a number, output of a line; therefore, the communication area will consist of three pointers. For the convenience of addressing, immediately at the entrance to the program, save the transmitted address in
EBP
/ RBP
.Another interesting point is access to memory (to poured variables). Instead of the typical x86 absolute addressing, x64 introduced "relocatable absolute", relative
RIP
. To really get the code cross-platform, you must use explicit relative addressing. Without further ado, we will address the cells of the poured variables with respect to the same
EBP
/ RBP
- i.e. immediately after the communication region, by offset 0x18
, the variable region begins. The generated p-code uses four “abstract physical registers”
R01..R04
, which naturally match real registersEAX,ECX,EDX,EBX
: i.e. the number of this register is obtained from the number of "abstract" by subtracting one. If we wanted to use the rest of the processor registers, the correspondence between the numbers would be more complicated. The way the parameters are passed to the function on x86 and on x64 is very different. On x86, the default parameters are passed through the stack ( cdecl ), but you can ask to
gcc
pass up to three parameters in registers ECX,EDX,EAX
( fastcall ). On x64, six parameters are transferred in registers RDI,RSI,RDX,RCX,R8,R9
, and it gcc
does not allow changing the way of calling. Due to this incompatibility, it will be necessary to implement both transmission mechanisms in the code generator, and select the necessary one. Fortunately, passing parameters will be the only platform-specific place in our code.On both processors, the called function has the right to spoil the contents
EAX,ECX,EDX
, so they will need to be saved before the call (if they are alive), and restored after. Thus, the information about the vitality of physical registers, which we used during their assignment, is also needed when generating the code. What code will be generated, sorted out; it’s up to the small thing - to realize the plan.
Loader
We take as a basis the long-standing p-code interpreter, leave all the husks (checking input, mapping the file to memory), and replace the gut (the cycle of execution and implementation of commands) with a call to the read code, as if it were a normal c-function. Note that the one that defined the command definitions disappeared : the new bootloader knows nothing about the j-script, nor about our p-code.
#include
#include
#include
#include
#include
const char* fdata = NULL; // весь прочитанный код
int input() { int d; scanf(" %d", &d); return d; }
void echoi(int i) { printf("%d", i); }
void echos(int offset) { printf("%s", fdata+offset); }
void* linkarea[1000] = {(void*)input, (void*)echoi, (void*) echos};
int main(int argc, char** argv) {
if(argc!=2) {
printf("Missing code file name.\n");
exit(1);
}
int fd = open(argv[1], O_RDONLY);
if (fd<0) {
printf("Cannot open code file.\n");
exit(1);
}
struct stat finfo;
fstat(fd, &finfo);
fdata = (const char*)mmap(0, finfo.st_size, PROT_READ|PROT_EXEC, MAP_PRIVATE, fd, 0);
if (!fdata) {
printf("Cannot read code file.\n");
exit(1);
}
((void (*)(void**)) fdata)(linkarea); // запуск
munmap((void*)fdata, finfo.st_size);
close(fd);
return 0;
}
#include "jsk.h"
Changes to p-code
Now that the p-code has become an internal representation, invisible outside the compiler, there is no sense in compressing each command into 4 bytes; therefore, no tricky
union
or field size restrictions are needed . Replace jsk.h
with the declaration of the usual structure with fields int
:
A feature of the x86 / x64 instruction set is its diversity and non-orthogonality. Almost any operation can be encoded in three to four different ways. We will try for each n-team to choose the most compact of the possible implementations. (This is the limiting case of “optimization through the peephole” - peephole optimization , which deals not with the program as a whole, but with short blocks of adjacent teams. In general, each team is processed independently of those around it.)
For example, the p-commandstruct command {
enum opcodes {
hlt, store, jz, echo, mov, load, input, add, sub, mul, div,
eq, ne, ge, le, gt, lt
};
opcodes opcode;
regnum dest, src1, src2;
int imm;
command(opcodes opcode, regnum dest, regnum src1, regnum src2) :
opcode(opcode), dest(dest), src1(src1), src2(src2) {}
command(opcodes opcode, regnum dest, int imm) :
opcode(opcode), dest(dest), imm(imm) {}
};
add
we can compile in MOV
, in INC
, in DEC
, in one of three forms ADD
or in one of three forms LEA
- depending on the combination of arguments. Some local optimizations, such as replacing
mul r, r, 2
with add r, r, r
, are conveniently performed at the p-code level, before being translated into executable code. At the same stage - after choosing physical registers, but before the final translation - we will get rid of the registers, the value of which is known in all places where they are used: during translation we will use the known value.// в подстановке констант:
// ...
if(i->cmd.opcode==command::mul) { // несколько замен
if(i->known.count(i->cmd.src1)) std::swap(i->cmd.src1, i->cmd.src2);
if(i->known.count(i->cmd.src2)) switch(i->known[i->cmd.src2]) {
case -1: i->cmd = command(command::sub, i->cmd.dest, 0, i->cmd.src1); break;
case 0: i->cmd = command(command::mov, i->cmd.dest, 0); break;
case 1: nopOut(i); break;
case 2: i->cmd = command(command::add, i->cmd.dest, i->cmd.src1, i->cmd.src1); break;
}
}
// удалить регистры, значение которых известно всюду, где используется
void postalloc() {
std::vector needed(lastreg+1);
foreach(i, pcode) {
if(i->has2src()) {
if(!i->known.count(i->cmd.src1)) needed[i->cmd.src1]=true;
if(!i->known.count(i->cmd.src2)) needed[i->cmd.src2]=true;
else // src2 известен: есть один особый случай
if(i->cmd.opcode==command::div && i->known[i->cmd.src2]!=2)
needed[i->cmd.src2]=true; // под делитель нужен регистр
}
if(i->readsdest() && !i->known.count(i->cmd.dest))
needed[i->cmd.dest]=true;
}
foreach(i, pcode)
if(i->writesdest() && !needed[i->cmd.dest])
nopOut(i);
}
// после успешного выбора всех физических регистров:
// ...
postalloc();
// вычистить NOP-ы
simpleopt();
// подставляем физические регистры: и в командах, и в known
foreach(i, pcode) {
std::map known;
if(i->known.count(i->cmd.dest))
known[physmap[i->cmd.dest]] = i->known[i->cmd.dest];
i->cmd.dest = physmap[i->cmd.dest];
if (i->has2src()) {
if(i->known.count(i->cmd.src1))
known[physmap[i->cmd.src1]] = i->known[i->cmd.src1];
if(i->known.count(i->cmd.src2))
known[physmap[i->cmd.src2]] = i->known[i->cmd.src2];
i->cmd.src1 = physmap[i->cmd.src1];
i->cmd.src2 = physmap[i->cmd.src2];
}
i->known = known;
}
break;
Generation
All the generated code will be stored in one large vector . We will need several new fields. will determine the position of the team in the vector . In addition, for the and commands , the offset value of which is unknown until the end of the code generation, we will store the index of the unfilled offset in the field . So, on the first pass through the p-code, we generate all the executable code in the vector ; then go through the code a second time and fill in the missing offsets. After that, we print the result of all the code and all the lines. I tried to remove the most uninteresting pieces from the listing, in order to at least slightly reduce it. The full compiler code is available at tyomitch.net.ru/jsk.y.natv.html
std::vector code;
struct commandn
int offset, length;
code
JZ
ECHO
int needfixat;
code
// последний этап: генерация кода для x86/x64
const char regsize = sizeof(void*); // 4 либо 8
// пролог
if(regsize==4) { // PUSH EBP / MOV EBP, [ESP+8] / PUSH EDI / PUSH ESI / PUSH EBX
code.push_back(0x55); code.push_back(0x8b); code.push_back(0x6c); code.push_back(0x24);
code.push_back(0x08); code.push_back(0x57); code.push_back(0x56); code.push_back(0x53);
} else { // PUSH EBP / MOV RBP, RDI / PUSH EBX
code.push_back(0x55); code.push_back(0x48); code.push_back(0x8b); code.push_back(0xef);
code.push_back(0x53);
}
foreach(i, pcode) {
int imm, doffset, joffset;
bool jshort, saveAX, saveDX;
i->offset = code.size();
switch(i->cmd.opcode) {
case command::hlt:
if(regsize==4) {
i->emit(0x5b, 0x5e, 0x5f, 0x5d); // POP EBX / POP ESI / POP EDI / POP EBP
i->emit(0xc2, 2, 0); // RET 2
} else
i->emit(0x5b, 0x5d, 0xc3); // POP EBX / POP EBP / RET
break;
case command::store: // MOV [EBP+18+(imm-1)*4], dst
doffset = 0x18+(i->cmd.imm-1)*4;
if(i->known.count(i->cmd.dest)) {
i->emit(0xc7);
if(doffset<128)
i->emit(0x45, (char)doffset);
else {
i->emit14(0x85, doffset);
}
i->emit4(i->known[i->cmd.dest]);
} else {
i->emit(0x89);
if(doffset<128)
i->emit(0x45|(i->cmd.dest-1)<<3, (char)doffset);
else
i->emit14(0x85|(i->cmd.dest-1)<<3, doffset);
}
break;
case command::jz:
joffset = i->tgt->offset - (i->offset+2); // предполагаем JMP SHORT
jshort = i->tgt->offset && (joffset>=-128); // прыжок назад и близко
if(!i->cmd.dest) { // JMP off
if(jshort)
i->emit(0xeb, (char)joffset);
else {
i->emit(0xe9);
if(i->tgt->offset) // прыжок назад
i->emit4(joffset-3);
else {
i->needfixat = code.size();
i->emit4(0);
}
}
break;
}
if(jshort && (i->cmd.dest==2)) { // JECXZ
i->emit(0xe3, (char)joffset);
break;
}
// OR dst, dst / JZ off
i->emit(0x0b, 0xc0|(i->cmd.dest-1)<<3|(i->cmd.dest-1));
if(jshort && (joffset>=-126))
i->emit(0x74, (char)(joffset-2));
else {
i->emit(0x0f, 0x84);
if(i->tgt->offset) // прыжок назад
i->emit4(joffset-6);
else {
i->needfixat = code.size();
i->emit4(0);
}
}
break;
case command::echo: // PUSH live / PUSH dst / CALL [EBP+?] / ADD ESP, 4 / POP live
foreach(rp, i->onexitp) if(*rp!=4) i->emit(0x50|(*rp-1));
if(!i->cmd.dest) { // imm, [EBP+8]
if(regsize==4) i->emit(0x68); else i->emit(0xbf);
i->needfixat = code.size();
i->emit4(0);
i->emit(0xff, 0x55, 2*regsize);
} else {
if(i->known.count(i->cmd.dest)) { // imm, [EBP+4]
imm = i->known[i->cmd.dest];
if(regsize==8)
i->emit14(0xbf, imm);
else if((imm>=-128)&&(imm<128))
i->emit(0x6a, (char)imm);
else
i->emit14(0x68, imm);
} else if(regsize==4) // dst, [EBP+4]
i->emit(0x50|(i->cmd.dest-1));
else
i->emit(0x8b, 0xf8|(i->cmd.dest-1));
i->emit(0xff, 0x55, regsize);
}
if(regsize==4) i->emit(0x83, 0xc4, 4);
foreachr(rp, i->onexitp) if(*rp!=4) i->emit(0x58|(*rp-1));
break;
case command::mov: // MOV dst, imm
if(i->cmd.imm)
i->emit14(0xb8|(i->cmd.dest-1), i->cmd.imm);
else // XOR dst, dst
i->emit(0x33, 0xc0|(i->cmd.dest-1)<<3|(i->cmd.dest-1));
break;
case command::load: // MOV dst, [EBP+(3+i)*4]
doffset = 0x18+(i->cmd.imm-1)*4;
i->emit(0x8b);
if(doffset<128)
i->emit(0x45|(i->cmd.dest-1)<<3, (char)doffset);
else
i->emit14(0x85|(i->cmd.dest-1)<<3, doffset);
break;
case command::input: // PUSH live / CALL [EBP+0] / XCHG EAX, dst / POP live
foreach(rp, i->onenterp) if(*rp!=4) i->emit(0x50|(*rp-1));
i->emit(0xff, 0x55, 0);
if(i->cmd.dest!=1) i->emit(0x90|(i->cmd.dest-1));
foreachr(rp, i->onenterp) if(*rp!=4) i->emit(0x58|(*rp-1));
break;
case command::add:
// максимум 1 операнд известен; поставим его в src2
if(i->known.count(i->cmd.src1) || (i->cmd.src2==i->cmd.dest))
std::swap(i->cmd.src1, i->cmd.src2);
if(!i->cmd.src2) // MOV dst, src1
i->emit(0x8b, 0xc0|(i->cmd.dest-1)<<3|(i->cmd.src1-1));
else if(i->known.count(i->cmd.src2)) {
imm = i->known[i->cmd.src2];
addimm:
if(i->cmd.dest==i->cmd.src1) // ADD dst, imm
if(imm==1) // INC dst
i->emit(0xff, 0xc0|(i->cmd.dest-1));
else if(imm==-1) // DEC dst
i->emit(0xff, 0xc8|(i->cmd.dest-1));
else if((imm>=-128)&&(imm<128))
i->emit(0x83, 0xc0|(i->cmd.dest-1), (char)imm);
else // for imm=128 we might use SUB Edst, -128
i->emit114(0x81, 0xc0|(i->cmd.dest-1), imm);
else // LEA dst, [src1+imm]
if((imm>=-128)&&(imm<128))
i->emit(0x8d, 0x40|(i->cmd.dest-1)<<3|(i->cmd.src1-1), (char)imm);
else
i->emit114(0x8d, 0x80|(i->cmd.dest-1)<<3|(i->cmd.src1-1), imm);
} else if(i->cmd.dest==i->cmd.src1) // ADD dst, src2
i->emit(0x03, 0xc0|(i->cmd.dest-1)<<3|(i->cmd.src2-1));
else // LEA dst, [src1+src2]
i->emit(0x8d, (i->cmd.dest-1)<<3|4, (i->cmd.src1-1)<<3|(i->cmd.src2-1));
break;
case command::sub: // ...
case command::mul: // ...
case command::div:
if(i->known.count(i->cmd.src2) && i->known[i->cmd.src2]==2) {
if(i->cmd.dest!=i->cmd.src1) // MOV dst, src1 / SAR dst, 1
i->emit(0x8b, 0xc0|(i->cmd.dest-1)<<3|(i->cmd.src1-1));
i->emit(0xd1, 0xf8|(i->cmd.dest-1));
break;
}
// для деления нужны именно EAX,EDX;
// сохраним их текущее содержимое в EDI,ESI, а после деления восстановим
// MOV EDI, EAX / MOV EAX, src1 / MOV ESI, EDX / XOR EDX, EDX
// IDIV src2 / XCHG dst, EAX / MOV EDX, ESI / MOV EAX, EDI
saveAX = i->onexitp.count((physreg)1) && (i->cmd.dest!=1);
saveDX = i->onexitp.count((physreg)3) && (i->cmd.dest!=3);
if(saveAX || (i->cmd.src2==1)) i->emit(0x8b, 0xf8);
if(i->cmd.src1!=1)
if(i->known.count(i->cmd.src1))
i->emit14(0xb8, i->known[i->cmd.src1]);
else
i->emit(0x8b, 0xc0|(i->cmd.src1-1));
if(saveDX || (i->cmd.src2==3)) i->emit(0x8b, 0xf2);
i->emit(0x33, 0xd2);
if(i->cmd.src2==1) // переехал в EDI
i->emit(0xf7, 0xff);
else if(i->cmd.src2==3) // переехал ESI
i->emit(0xf7, 0xfe);
else
i->emit(0xf7, 0xf8|(i->cmd.src2-1));
if(i->cmd.dest!=1)
i->emit(0x90|(i->cmd.dest-1));
if(saveDX) i->emit(0x8b, 0xd6);
if(saveAX) i->emit(0x8b, 0xc7);
break;
case command::eq: case command::ne: case command::ge:
case command::le: case command::gt: case command::lt:
// максимум 1 операнд известен; поставим его в src2
if(i->known.count(i->cmd.src1)) {
std::swap(i->cmd.src1, i->cmd.src2);
switch(i->cmd.opcode) {
case command::ge: i->cmd.opcode = command::le; break;
case command::le: i->cmd.opcode = command::ge; break;
case command::gt: i->cmd.opcode = command::lt; break;
case command::lt: i->cmd.opcode = command::gt; break;
}
}
// CMP src1, src2 / SETcc dstL / MOVZX dst, dstL
if(i->known.count(i->cmd.src2)) {
imm = i->known[i->cmd.src2];
if((imm>=-128)&&(imm<128))
i->emit(0x83, 0xf8|(i->cmd.src1-1), (char)imm);
else if(i->cmd.src1==1)
i->emit14(0x3d, imm);
else
i->emit114(0x81, 0xf8|(i->cmd.src1-1), imm);
} else
i->emit(0x3b, 0xc0|(i->cmd.src1-1)<<3|(i->cmd.src2-1));
i->emit(0x0f);
switch(i->cmd.opcode) {
case command::eq: i->emit(0x94); break;
case command::ne: i->emit(0x95); break;
// ...
}
i->emit(0xc0|(i->cmd.dest-1));
i->emit(0x0f, 0xb6, 0xc0|(i->cmd.dest-1)<<3|(i->cmd.dest-1));
break;
default:
yyerror("unknown opcode");
}
}
// после всего кода -- строки
int offset = code.size();
std::vector offsets(strings.size()+1);
foreach(i, strings) {
offsets[i->second] = offset;
offset += i->first.length();
offset++;
}
// второй проход: подставляем смещения
foreach(i, pcode) if(i->needfixat)
if(i->cmd.opcode==command::jz)
i->fix4(i->tgt->offset-(i->offset+i->length));
else if (i->cmd.opcode==command::echo)
i->fix4(offsets[i->cmd.imm]);
write(1, &*code.begin(), code.size()); // вывод кода
foreach(i, strings) // вывод строк
write(1, i->first.c_str(), i->first.length()+1);
What happened?
We compile the compiler, compile a test program with it, compile the bootloader, and run: I wonder what our program compiled into?
[tyomitch@home ~]$ bison jsk.y
[tyomitch@home ~]$ c++ jsk.tab.c lex.yy.c -o jskc
[tyomitch@home ~]$ ./jskc < test.jsk > code
[tyomitch@home ~]$ cc jskld.c -o jskld
[tyomitch@home ~]$ ./jskld code
Задумай число от 0 до 1000, а я буду угадывать
Это 500? (1=меньше, 2=больше, 3=попал) 1
Это 249? (1=меньше, 2=больше, 3=попал) 3
Ура! Я молодец!
00:55 push rbp 01: 48 8b ef mov rbp, rdi 04:39 push rbx 05:33 c0 xor eax, eax 07: b9 e8 03 00 00 mov ecx, 0x3e8 0c: 50 push rax 0d: 51 push rcx 0e: bf 80 01 00 00 mov edi, 0x180 13: ff 55 10 call qword ptr [rbp + 0x10] 16: 59 pop rcx 17: 58 pop rax 18: 50 push rax 19: 51 push rcx 1a: bf 00 00 00 00 mov edi, 0x0 1f: ff 55 08 call qword ptr [rbp + 0x8] 22: 59 pop rcx 23: 58 pop rax 24: 50 push rax 25: 51 push rcx 26: bf fa 00 00 00 mov edi, 0xfa 2b: ff 55 10 call qword ptr [rbp + 0x10] 2e: 59 pop rcx 2f: 58 pop rax 30: 50 push rax 31: 51 push rcx 32: bf e8 03 00 00 mov edi, 0x3e8 37: ff 55 08 call qword ptr [rbp + 0x8] 3a: 59 pop rcx 3b: 58 pop rax 3c: 50 push rax 3d: 51 push rcx 3e: bf 01 01 00 00 mov edi, 0x101 43: ff 55 10 call qword ptr [rbp + 0x10] 46: 59 pop rcx 47: 58 pop rax 48: 3b c1 cmp eax, ecx 4a: 0f 9e c2 setle dl 4d: 0f b6 d2 movzx edx, dl 50: 0b d2 or edx, edx 52: 0f 84 97 00 00 00 je 0xef 58: 8d 14 01 lea edx, [rcx + rax * 1] 5b: d1 fa sar edx, 1 5d: 50 push rax 5e: 51 push rcx 5f: 52 push rdx 60: bf bc 01 00 00 mov edi, 0x1bc 65: ff 55 10 call qword ptr [rbp + 0x10] 68: 5a pop rdx 69: 59 pop rcx 6a: 58 pop rax 6b: 50 push rax 6c: 51 push rcx 6d: 52 push rdx 6e: 8b fa mov edi, edx 70: ff 55 08 call qword ptr [rbp + 0x8] 73: 5a pop rdx 74: 59 pop rcx 75: 58 pop rax 76: 50 push rax 77: 51 push rcx 78: 52 push rdx 79: bf 26 01 00 00 mov edi, 0x126 7e: ff 55 10 call qword ptr [rbp + 0x10] 81: 5a pop rdx 82: 59 pop rcx 83: 58 pop rax 84: 50 push rax 85: 51 push rcx 86: 52 push rdx 87: ff 55 00 call qword ptr [rbp + 0x0] 8a: 93 xchg ebx, eax 8b: 5a pop rdx 8c: 59 pop rcx 8d: 58 pop rax 8e: 89 45 18 mov dword ptr [rbp + 0x18], eax 91: 83 fb 01 cmp ebx, 0x1 94: 0f 94 c0 sete al 97: 0f b6 c0 movzx eax, al 9a: 0b c0 or eax, eax 9c: 0f 84 08 00 00 00 je 0xaa a2: 8b 45 18 mov eax, dword ptr [rbp + 0x18] a5: 8d 4a ff lea ecx, [rdx-0x1] a8: eb 9e jmp 0x48 aa: 83 fb 02 cmp ebx, 0x2 ad: 0f 94 c0 sete al b0: 0f b6 c0 movzx eax, al b3: 0b c0 or eax, eax b5: 0f 84 04 00 00 00 je 0xbf bb: 03 c2 add eax, edx bd: eb 89 jmp 0x48 bf: 8b 45 18 mov eax, dword ptr [rbp + 0x18] c2: 83 fb 03 cmp ebx, 0x3 c5: 0f 94 c2 sete dl c8: 0f b6 d2 movzx edx, dl cb: 0b d2 or edx, edx cd: 0f 84 0b 00 00 00 je 0xde d3: bf a0 01 00 00 mov edi, 0x1a0 d8: ff 55 10 call qword ptr [rbp + 0x10] db: 5b pop rbx dc: 5d pop rbp dd: c3 ret de: 50 push rax df: 51 push rcx e0: bf c4 01 00 00 mov edi, 0x1c4 e5: ff 55 10 call qword ptr [rbp + 0x10] e8: 59 pop rcx e9: 58 pop rax ea: e9 59 ff ff ff jmp 0x48 ef: bf 59 01 00 00 mov edi, 0x159 f4: ff 55 10 call qword ptr [rbp + 0x10] f7: 5b pop rbx f8: 5d pop rbp f9: c3 ret
With the naked eye you can see at least two relevant optimizations that need an “eye” for more than one team:
- when the command ends on
POP reg
, and the next starts onPUSH reg
, both of these instructions can be deleted; - the combination
SETcc regl / MOVZX reg, regl / OR reg, reg / JZ offset
can be replaced with one instructionJNcc offset
, ifreg
not alive at the exitJZ
.