Compilation. 8: optimization
After a pleasant stay, we continue to write a compiler for our j-script.
In a previous post, we implemented a heuristic taken from the ceiling for register assignment , and at the same time began to optimize the code. And before that, readers found a bug in the implementation of assignment.
The fact is that we decided to cheat, and when assigning the value to a variable for the first time, do not actually copy, but simply declare the register with the intermediate value as the storage location for the variable: Then when compiling an operation of the type we get one command (from convolution 2) and remember (from convolution assignment). Everything is true, simple and natural. The problem arose when, on the right side of the assignment, not an intermediate value was used, but something long-lived: for example, another variable.
In the second convolution of the assignment, a new code is not generated - we only remember
Both variables were in the same register, and when one of them changes, the second will also change.
The solution lies on the surface: for each new variable we start a new register. Then from get already a couple of teams
if it is followed
other words, is now in the generated code will be a lot of extra copy from register to register, and the register assignment will run more slowly - because of the fact that the number of registers used is growing.
We will try to find those cases where copying is unnecessary (for example, if the variable from the right side of the assignment does not change further), and we will correct the commands that use the copy so that they use the original.
Copy elimination is one of the simple optimizations that I promised from the very first post of the series. As for the optimizations performed during register assignment, data-flow analysis is convenient for cleaning. An important difference from the two previous applications of DFA (going back to detect live registers, going forward to find saved registers) will be that at each point we store not one set of registers, but the division of all registers into sets of the same ones. You can look at this as a more general case of DFA than the two previously discussed. (Before, registers were always divided into two sets - “included” and “excluded.”)
How will we split the registers?
And one more subtlety: since during the transitions from team to team we do not build up sets, but truncate (we intersect all incoming sets), before starting DFA, you need to initialize sets not into empty, but into comprehensive ones - and as the algorithm works, the sets are truncated, as necessary. In order not to spend money and not really hold in every team the set of all existing registers, we will agree to consider the “missing iterator” pointing to such a comprehensive set.
For convenience, we format the three operations we need on partitions into a class. In the partition, we store the list of sets into which the registers are divided (except for the “global” set, in which the registers are all together initially), and for each register (except those in the “global” set), an iterator of the set in which it is located . I had some strange problems
with wood
At the same time,
Compared to the last time, the code was reduced by a couple more commands:
It might seem that about the disappeared teams (
Another standard optimization, which, like the previous ones, is implemented using DFA, is constant folding . The principle is very simple: if the values of the operands are known, then the operation can be performed immediately at compilation. For example, instead of code
Manipulating constants are not necessarily indicative of programmer negligence POLENOV calculate beforehand known result: for example,
this time, in each instruction store a plurality of registers for which values are known, - together with the values: as As usual, we formulate three rules for calculating sets:
When the sets stabilize, we can replace each operation, both operands of which are known, with
The same data will allow us to perform another optimization - constant propagation (substituting a known value instead of a register reference). This optimization is not possible with the p-code format chosen by us, because there are no operations on the register and constant in it; such operations, however, are present in many real processors, so it will be possible to perform full-fledged “constant substitution” when generating executable code. Now we restrict ourselves to replacing the zero value with R0.
For example, a type construct
If at the same time we replace the zero value with R0:
Similarly, you can remove from code
Another benefit of folding constants is that negative values can now be used in our commands. Due to the fact that in the lexer we set the token with
In a previous post, we implemented a heuristic taken from the ceiling for register assignment , and at the same time began to optimize the code. And before that, readers found a bug in the implementation of assignment.
Further in the post:
- Fixing a bug
- Cleaning copies
- What happened?
- Folding constants
- Implementation
Fixing a bug
The fact is that we decided to cheat, and when assigning the value to a variable for the first time, do not actually copy, but simply declare the register with the intermediate value as the storage location for the variable: Then when compiling an operation of the type we get one command (from convolution 2) and remember (from convolution assignment). Everything is true, simple and natural. The problem arose when, on the right side of the assignment, not an intermediate value was used, but something long-lived: for example, another variable.
ID '=' EXPR { $$ = $3;
if(vars[$1])
emit(command::add, vars[$1], $3, 0);
else
vars[$1] = $3; // new var
}
a=2;
MOV R1, 2
vars["a"]=R1
a = 2; b = a;
In the second convolution of the assignment, a new code is not generated - we only remember
vars["b"]=R1
Both variables were in the same register, and when one of them changes, the second will also change.
The solution lies on the surface: for each new variable we start a new register. Then from get already a couple of teams
ID '=' EXPR { $$ = $3;
if(!vars[$1]) vars[$1] = newreg();
emit(command::add, vars[$1], $3, 0);
}
a=2;
MOV R1, 2 ADD R2, R1, 0and remember
vars["a"]=R2
if it is followed
b = a;
- it will be added to the code MOV R3, R2, 0
, and vars["b"]=R3
other words, is now in the generated code will be a lot of extra copy from register to register, and the register assignment will run more slowly - because of the fact that the number of registers used is growing.
We will try to find those cases where copying is unnecessary (for example, if the variable from the right side of the assignment does not change further), and we will correct the commands that use the copy so that they use the original.
Cleaning copies
Copy elimination is one of the simple optimizations that I promised from the very first post of the series. As for the optimizations performed during register assignment, data-flow analysis is convenient for cleaning. An important difference from the two previous applications of DFA (going back to detect live registers, going forward to find saved registers) will be that at each point we store not one set of registers, but the division of all registers into sets of the same ones. You can look at this as a more general case of DFA than the two previously discussed. (Before, registers were always divided into two sets - “included” and “excluded.”)
How will we split the registers?
- the command
ADD RA, RB, 0
carries overRA
into the set kRB
; - any other case-altering command puts it into a singleton set;
RA
andRB
they will be together in team C if they are together in all teams from which you can go to C (directly or by jump).
RA
you can use instead RB
if they are together in all the commands where it is used RA
. And one more subtlety: since during the transitions from team to team we do not build up sets, but truncate (we intersect all incoming sets), before starting DFA, you need to initialize sets not into empty, but into comprehensive ones - and as the algorithm works, the sets are truncated, as necessary. In order not to spend money and not really hold in every team the set of all existing registers, we will agree to consider the “missing iterator” pointing to such a comprehensive set.
For convenience, we format the three operations we need on partitions into a class. In the partition, we store the list of sets into which the registers are divided (except for the “global” set, in which the registers are all together initially), and for each register (except those in the “global” set), an iterator of the set in which it is located . I had some strange problems
with wood
std::set
, so I wrote myself a container bit::set
with a similar interface, but with the std::bitset
inside. By the template parameter, it takes the maximum key value in the set. At the same time,
bit::set
standard operations on sets ( &=
, |=
) are taken out .
In adding a new field
is now the usual way implement DFA, using the ready-operation:typedef bit::set regset;
class regPartition {
typedef std::list regsets;
regsets sets;
std::map byreg;
// изначально: все регистры в "глобальном" разбиении
public:
// возвращает: изменилось ли разбиение
bool add(regnum copy, regnum orig) {
if (byreg.count(copy)) {
if(byreg[copy]==byreg[orig]) // уже вместе
return false;
byreg[copy]->erase(copy);
// был последним?
if(!byreg[copy]->size())
sets.erase(byreg[copy]);
}
assert(byreg.count(orig));
byreg[copy] = byreg[orig];
byreg[copy]->insert(copy);
return true;
}
void remove(regnum r) {
if (byreg.count(r)) {
if(byreg[r]->size()==1) return; // уже один
byreg[r]->erase(r);
}
byreg[r] = sets.insert(sets.end(), regset());
byreg[r]->insert(r);
}
// возвращает: изменилось ли разбиение
bool isect(/*const*/regPartition& p, const command& self) {
bool changed = false;
// оставить вместе только тех, кто вместе и в p
foreach(i, byreg) {
regnum r = i->first;
// split by p.byreg[r]
regsets::iterator withR = i->second,
withoutR = sets.insert(sets.end(), regset());
foreach2(j, (*withR), next)
// не разделяем, если текущая команда -- mov с теми же регистрами:
// всё равно на следующем шагу объединятся
if(!(self.opcode==command::add && !self.src2 &&
((self.src1==r && self.dest==*j)||(self.dest==r && self.src1==*j)))
&&((!p.byreg.count(r) && p.byreg.count(*j)) || // R in global, J isn't
(p.byreg.count(r) && !p.byreg[r]->count(*j)))) { // R not in global
withR->erase(*j);
withoutR->insert(*j);
byreg[*j] = withoutR;
}
if(!withoutR->size()) // ничего не отделилось
sets.erase(withoutR);
else // разделили
changed = true;
}
// теперь те, что в глобальном в this, но не в p
foreach(i, p.sets) {
regset inP;
foreach(j, (*i))
if(!byreg.count(*j)) inP.insert(*j);
if(inP.size()) {
regsets::iterator newSet = sets.insert(sets.end(), inP);
foreach(j, inP) byreg[*j] = newSet;
changed = true;
}
}
return changed;
}
// применение разбиения: формируем для r множество возможных замен (во всём коде)
void apply(regnum r, regset& global) {
if (!r) return; // may not be replaced
assert(byreg.count(r));
if(!global.size()) // uninitialized set
global = *byreg[r];
else // initialized; intersect
global &= *byreg[r];
}
}
struct commandn
regPartition copies;
void copies() {
// а) вычисляем разбиения для каждой команды
bool changed = false;
foreach(i, pcode) {
i->copies = regPartition();
// rule 2
if (writesdest(i)) {
i->copies.remove(i->cmd.dest);
changed = true;
}
}
while (changed) {
changed = false;
foreach2(i, pcode, next) {
// rule 1
if (i->cmd.opcode==command::add && !i->cmd.src2)
changed |= i->copies.add(i->cmd.dest, i->cmd.src1);
// rule 3 (next command)
if (hasnext(i))
changed |= next->copies.isect(i->copies, next->cmd);
// rule 3 (jmp target)
if (i->cmd.opcode==command::jz)
changed |= i->tgt->copies.isect(i->copies, i->tgt->cmd);
}
}
// б) вычисляем возможные замены во всём коде
std::vector copies(lastreg+1);
foreach(i, pcode) {
if(readsdest(i))
i->copies.apply(i->cmd.dest, copies[i->cmd.dest]);
if(has2src(i)) {
i->copies.apply(i->cmd.src1, copies[i->cmd.src1]);
i->copies.apply(i->cmd.src2, copies[i->cmd.src2]);
}
}
// в) объединяем копии
for(regnum r=1; r<=lastreg; r++) {
copies[r].erase(r);
if(copies[r].size()) { // остались возможные замены?
regnum s = *(copies[r].begin()); // заменим r на s
foreach(i, pcode) { // во всём коде
if(i->cmd.dest==r)
i->cmd.dest = s;
if(has2src(i)) {
if(i->cmd.src1==r) i->cmd.src1 = s;
if(i->cmd.src2==r) i->cmd.src2 = s;
}
if(i->cmd.opcode==command::add && i->cmd.src1==i->cmd.dest && !i->cmd.src2) // self-mov
nopOut(i);
}
foreach(c, copies) // и в векторе замен
if(c->count(r)) {
c->erase(r);
c->insert(s);
}
}
}
}
We copies();
insert the call at the very beginning of the optimization cycle, before checking liveliness.What happened?
Compared to the last time, the code was reduced by a couple more commands:
00 mov r01,0 01 mov r02, 0x3e8 02 echo 0x126 03 echo r01 04 echo 0xa0 05 echo r02 06 echo 0xa7 07 le r03, r01, r02 08 jz r03, + 0x1d (= 0x26) 09 add r03, r01, r02 0a mov r04, 2 0b div r03, r03, r04 0c echo 0x162 0d echo r03 0e echo 0xcc 0f input r04 10 store r01, 1 11 mov r01, 1 12 eq r01, r04, r01 13 jz r01, +4 (= 0x18) 14 load r01, 1 15 mov r02, 1 16 sub r02, r03, r02 17 jz 0, -0x11 (= 0x7) 18 mov r01, 2 19 eq r01, r04, r01 1a jz r01, +3 (= 0x1e) 1b mov r01, 1 1c add r01, r03, r01 1d jz 0, -0x17 (= 0x7) 1e load r01, 1 1f mov r03, 3 20 eq r03, r04, r03 21 jz r03, +2 (= 0x24) 22 echo 0x146 23 hlt 24 echo 0x16a 25 jz 0, -0x1f (= 7) 26 echo 0xff 27 hlt
It might seem that about the disappeared teams (
add r01, r01, 0
and add r02, r02, 0
) it was immediately obvious that they were meaningless. In fact, these commands took a meaningless form only after the appointment of physical registers, i.e. at the very last stage before outputting the finished p-code. Until then, the numbers of the n-registers of the operands were different; only the analysis that we just performed allowed us to combine them and delete the copying that made senseless — all this long before the appointment of physical registers.Folding constants
Another standard optimization, which, like the previous ones, is implemented using DFA, is constant folding . The principle is very simple: if the values of the operands are known, then the operation can be performed immediately at compilation. For example, instead of code
MOV R1, 2 MOV R2, 3 ADD R3, R1, R2we can generate immediately
MOV R3.5
Manipulating constants are not necessarily indicative of programmer negligence POLENOV calculate beforehand known result: for example,
pixels=1024*768;
it is easier to read and maintain than pixels=786432;
this time, in each instruction store a plurality of registers for which values are known, - together with the values: as As usual, we formulate three rules for calculating sets:
std::map
- in a command, the
MOV R, X
value of R is known, and this is X ; - in any other command that sets the value of R, this value is unknown;
- in team C, the value of R is known if it is known and the same in all teams from which you can go to C (directly or by jump).
When the sets stabilize, we can replace each operation, both operands of which are known, with
MOV
. The same data will allow us to perform another optimization - constant propagation (substituting a known value instead of a register reference). This optimization is not possible with the p-code format chosen by us, because there are no operations on the register and constant in it; such operations, however, are present in many real processors, so it will be possible to perform full-fledged “constant substitution” when generating executable code. Now we restrict ourselves to replacing the zero value with R0.
For example, a type construct
if (1>2) { echo("unreachable"); }
that compiles toMOV R1, 1 MOV R2, 2 GT R3, R1, R2 JZ R3, label ECHO "unreachable" label:will turn at the stage of folding constants in
MOV R1, 1 MOV R2, 2 MOV R3, 0 JZ R3, label ECHO "unreachable" label:and the optimization “destruction of non-living code” already implemented by us last time will remove the first two teams
MOV
. If at the same time we replace the zero value with R0:
MOV R3, 0 JZ R0, label ECHO "unreachable" label:then along with the non-living code the last one will be deleted
MOV
, and the “destruction of the unreachable code” will also delete ECHO
, turning it JZ
into NOP
. Similarly, you can remove from code
JZ
with a known non-zero value. The second implemented “special case” is the replacement of commands ADD RX, (0), RY
by ADD RX, RY, R0
so that the copy cleaning algorithm recognizes copying from register to register in this command. Another benefit of folding constants is that negative values can now be used in our commands. Due to the fact that in the lexer we set the token with
NUM
regexp [0-9]+
, strings like "-123" were interpreted as the unary minus and then the literal 123; so they compiled into p-code likeMOV R1, 123 SUB R2, R0, R1Now in the p-code there will be an honest team
MOV R1, -123
.Implementation
struct commandn
It is supplemented by a couple of fields:
The basis of optimization, as in previous cases, is DFA:
The procedure implements the union of sets of unknown registers: a register with several known values is declared unknown.
The last thing left is the actual calculation of the value when the operands are known. In the same way as in the j-script executor, we start a function for each opcode:
Insert a call in front of it and finish with optimization.
In the next post - collect from p-code-set of physical registers real executable code for x86 / x64.std::map known; regset unknown;
void constp() {
bool changed = false;
foreach(i, pcode) {
i->known.clear(); i->unknown.clear();
if (i->cmd.opcode==command::mov) { // rule 1
i->known[i->cmd.dest] = i->cmd.imm;
changed = true;
} else if(writesdest(i)) { // rule 2
i->unknown.insert(i->cmd.dest);
changed = true;
}
}
while(changed) {
changed = false;
foreach2(i, pcode, next) {
// rule 3 (next command)
if (hasnext(i))
changed |= propagate(i, next);
// rule 3 (jmp target)
if (i->cmd.opcode==command::jz)
changed |= propagate(i, i->tgt);
}
}
// заменим известные значения
foreach(i, pcode) {
i->known[0] = 0; // R0 известен всегда
if(has2src(i) && i->known.count(i->cmd.src1) && i->known.count(i->cmd.src2))
i->cmd = command(command::mov, i->cmd.dest, ops[i->cmd.opcode](i->known[i->cmd.src1],i->known[i->cmd.src2]));
// подставляем 0
if(has2src(i)) {
if(i->known.count(i->cmd.src1) && !i->known[i->cmd.src1])
i->cmd.src1 = 0;
if(i->known.count(i->cmd.src2) && !i->known[i->cmd.src2])
i->cmd.src2 = 0;
if(i->cmd.opcode==command::add && !i->cmd.src1) { // чтоб распознавалось как копирование
i->cmd.src1 = i->cmd.src2;
i->cmd.src2 = 0;
}
}
if(readsdest(i) && i->known.count(i->cmd.dest))
if(!i->known[i->cmd.dest])
i->cmd.dest = 0;
else // значение известно, но это не 0
if(i->cmd.opcode==command::jz) nopOut(i);
}
}
propagate()
bool propagate(pcommandn from, pcommandn to) {
bool changed = false; // возвращает: изменились ли множества
// проверяем известные значения
foreach(i, from->known) {
regnum r = i->first;
if(to->known.count(r))
if((to->known[r]!=i->second) // другое, и не заданное правилом 1
&&!((to->cmd.opcode==command::mov) && (to->cmd.dest==r))) {
to->known.erase(r);
to->unknown.insert(r);
changed = true;
} else; // значение известное и верное
else if(!to->unknown.count(r)) { // по умолчанию, известно
to->known[r]=i->second;
changed = true;
}
}
// объединяем неизвестные
foreach(r, from->unknown)
if(!to->unknown.count(*r)) {
to->unknown.insert(*r);
to->known.erase(*r);
changed = true;
}
return changed;
}
int hlt(int src1, int src2) { assert(false); return 0; }
int add(int src1, int src2) { return src1+src2; }
int sub(int src1, int src2) { return src1-src2; }
...
int lt(int src1, int src2) { return src1
int (*ops[])(int, int) = {hlt, hlt, hlt, hlt, hlt, hlt, hlt, add, sub, mul, div, eq, ne, ge, le, gt, lt};
constp();
copies();