tyomitch August 31, 2010 at 00:53

Compilation. 8: optimization

After a pleasant stay, we continue to write a compiler for our j-script.
In a previous post, we implemented a heuristic taken from the ceiling for register assignment , and at the same time began to optimize the code. And before that, readers found a bug in the implementation of assignment.

Further in the post:

Fixing a bug
Cleaning copies
What happened?
Folding constants
Implementation

Fixing a bug

The fact is that we decided to cheat, and when assigning the value to a variable for the first time, do not actually copy, but simply declare the register with the intermediate value as the storage location for the variable: Then when compiling an operation of the type we get one command (from convolution 2) and remember (from convolution assignment). Everything is true, simple and natural. The problem arose when, on the right side of the assignment, not an intermediate value was used, but something long-lived: for example, another variable.

    ID '=' EXPR       { $$ = $3;

                        if(vars[$1])

                            emit(command::add, vars[$1], $3, 0);

                        else

                            vars[$1] = $3; // new var

                      }

a=2;MOV R1, 2vars["a"]=R1

a = 2;
b = a;

In the second convolution of the assignment, a new code is not generated - we only remember vars["b"]=R1
Both variables were in the same register, and when one of them changes, the second will also change.

The solution lies on the surface: for each new variable we start a new register. Then from get already a couple of teams

    ID '=' EXPR       { $$ = $3;

                        if(!vars[$1]) vars[$1] = newreg();

                        emit(command::add, vars[$1], $3, 0);

                      }

a=2;

MOV R1, 2
ADD R2, R1, 0

and remember vars["a"]=R2
if it is followed b = a;- it will be added to the code MOV R3, R2, 0, and vars["b"]=R3

other words, is now in the generated code will be a lot of extra copy from register to register, and the register assignment will run more slowly - because of the fact that the number of registers used is growing.
We will try to find those cases where copying is unnecessary (for example, if the variable from the right side of the assignment does not change further), and we will correct the commands that use the copy so that they use the original.

Cleaning copies

Copy elimination is one of the simple optimizations that I promised from the very first post of the series. As for the optimizations performed during register assignment, data-flow analysis is convenient for cleaning. An important difference from the two previous applications of DFA (going back to detect live registers, going forward to find saved registers) will be that at each point we store not one set of registers, but the division of all registers into sets of the same ones. You can look at this as a more general case of DFA than the two previously discussed. (Before, registers were always divided into two sets - “included” and “excluded.”)

How will we split the registers?

the command ADD RA, RB, 0carries over RAinto the set k RB;
any other case-altering command puts it into a singleton set;
RAand RBthey will be together in team C if they are together in all teams from which you can go to C (directly or by jump).

After receiving the final splitting, it will be seen where the original can be used instead of the copy: RAyou can use instead RBif they are together in all the commands where it is used RA.

And one more subtlety: since during the transitions from team to team we do not build up sets, but truncate (we intersect all incoming sets), before starting DFA, you need to initialize sets not into empty, but into comprehensive ones - and as the algorithm works, the sets are truncated, as necessary. In order not to spend money and not really hold in every team the set of all existing registers, we will agree to consider the “missing iterator” pointing to such a comprehensive set.

For convenience, we format the three operations we need on partitions into a class. In the partition, we store the list of sets into which the registers are divided (except for the “global” set, in which the registers are all together initially), and for each register (except those in the “global” set), an iterator of the set in which it is located . I had some strange problems

with wood std::set, so I wrote myself a container bit::setwith a similar interface, but with the std::bitsetinside. By the template parameter, it takes the maximum key value in the set.
At the same time, bit::setstandard operations on sets ( &=, |=) are taken out . In adding a new field is now the usual way implement DFA, using the ready-operation:

typedef bit::set regset;


class regPartition {

    typedef std::list regsets;

    regsets sets;

    std::map byreg;

// изначально: все регистры в "глобальном" разбиении


public:

    // возвращает: изменилось ли разбиение

    bool add(regnum copy, regnum orig) {

        if (byreg.count(copy)) {

            if(byreg[copy]==byreg[orig]) // уже вместе

                return false;

            byreg[copy]->erase(copy);

            // был последним?

            if(!byreg[copy]->size())

                sets.erase(byreg[copy]);

        }

        assert(byreg.count(orig));

        byreg[copy] = byreg[orig];

        byreg[copy]->insert(copy);

        return true;

    }


    void remove(regnum r) {

        if (byreg.count(‌r)) {

            if(byreg[r]->size()==1) return; // уже один

            byreg[r]->erase(‌r);

        }

        byreg[r] = sets.insert(sets.end(), regset());

        byreg[r]->insert(‌r);

    }


    // возвращает: изменилось ли разбиение

    bool isect(/*const*/regPartition& p, const command& self) {

        bool changed = false;

        // оставить вместе только тех, кто вместе и в p

        foreach(i, byreg) {

            regnum r = i->first;

            // split by p.byreg[r]

            regsets::iterator withR = i->second,

                         withoutR = sets.insert(sets.end(), regset());

            foreach2(j, (*withR), next)

                // не разделяем, если текущая команда -- mov с теми же регистрами:

                //   всё равно на следующем шагу объединятся

                if(!(self.opcode==command::add && !self.src2 &&

                    ((self.src1==r && self.dest==*j)||(self.dest==r && self.src1==*j)))

                &&((!p.byreg.count(‌r) && p.byreg.count(*j)) ||  // R in global, J isn't

                    (p.byreg.count(‌r) && !p.byreg[r]->count(*j)))) { // R not in global

                        withR->erase(*j);

                        withoutR->insert(*j);

                        byreg[*j] = withoutR;

                }

            if(!withoutR->size()) // ничего не отделилось

                sets.erase(withoutR);

            else // разделили

                changed = true;

        }

        // теперь те, что в глобальном в this, но не в p

        foreach(i, p.sets) {

            regset inP;

            foreach(j, (*i))

                if(!byreg.count(*j)) inP.insert(*j);

            if(inP.size()) {

                regsets::iterator newSet = sets.insert(sets.end(), inP);

                foreach(j, inP) byreg[*j] = newSet;

                changed = true;

            }

        }

        return changed;

    }


// применение разбиения: формируем для r множество возможных замен (во всём коде)

    void apply(regnum r, regset& global) {

        if (!r) return; // may not be replaced

        assert(byreg.count(‌r));

        if(!global.size()) // uninitialized set

            global = *byreg[r];

        else // initialized; intersect

            global &= *byreg[r];

    }

}

struct commandnregPartition copies;

void copies() {

        // а) вычисляем разбиения для каждой команды

        bool changed = false;

        foreach(i, pcode) {

            i->copies = regPartition();

            // rule 2

            if (writesdest(i)) {

                i->copies.remove(i->cmd.dest);

                changed = true;

            }

        }

        while (changed) {

            changed = false;

            foreach2(i, pcode, next) {

                // rule 1

                if (i->cmd.opcode==command::add && !i->cmd.src2)

                    changed |= i->copies.add(i->cmd.dest, i->cmd.src1);

                // rule 3 (next command)

                if (hasnext(i))

                    changed |= next->copies.isect(i->copies, next->cmd);

                // rule 3 (jmp target)

                if (i->cmd.opcode==command::jz)

                    changed |= i->tgt->copies.isect(i->copies, i->tgt->cmd);

            }

        }

        // б) вычисляем возможные замены во всём коде

        std::vector copies(lastreg+1);

        foreach(i, pcode) {

            if(readsdest(i))

                i->copies.apply(i->cmd.dest, copies[i->cmd.dest]);

            if(has2src(i)) {

                i->copies.apply(i->cmd.src1, copies[i->cmd.src1]);

                i->copies.apply(i->cmd.src2, copies[i->cmd.src2]);

            }

        }

        // в) объединяем копии

        for(regnum r=1; r<=lastreg; r++) {

            copies[r].erase(‌r);

            if(copies[r].size()) {  // остались возможные замены?

                regnum s = *(copies[r].begin()); // заменим r на s

                foreach(i, pcode) { // во всём коде

                    if(i->cmd.dest==r)

                        i->cmd.dest = s;

                    if(has2src(i)) {

                        if(i->cmd.src1==r) i->cmd.src1 = s;

                        if(i->cmd.src2==r) i->cmd.src2 = s;

                    }

                    if(i->cmd.opcode==command::add && i->cmd.src1==i->cmd.dest && !i->cmd.src2) // self-mov

                        nopOut(i);

                }

                foreach(c, copies)  // и в векторе замен

                    if(c->count(‌r)) {

                        c->erase(‌r);

                        c->insert(s);

                    }

            }

        }

}

We copies();insert the call at the very beginning of the optimization cycle, before checking liveliness.

What happened?

Compared to the last time, the code was reduced by a couple more commands:

00 mov r01,0
01 mov r02, 0x3e8
02 echo 0x126
03 echo r01
04 echo 0xa0
05 echo r02
06 echo 0xa7
07 le r03, r01, r02
08 jz r03, + 0x1d (= 0x26)
09 add r03, r01, r02
0a mov r04, 2
0b div r03, r03, r04
0c echo 0x162
0d echo r03
0e echo 0xcc
0f input r04
10 store r01, 1
11 mov r01, 1
12 eq r01, r04, r01
13 jz r01, +4 (= 0x18)
14 load r01, 1
15 mov r02, 1
16 sub r02, r03, r02
17 jz 0, -0x11 (= 0x7)
18 mov r01, 2
19 eq r01, r04, r01
1a jz r01, +3 (= 0x1e)
1b mov r01, 1
1c add r01, r03, r01
1d jz 0, -0x17 (= 0x7)
1e load r01, 1
1f mov r03, 3
20 eq r03, r04, r03
21 jz r03, +2 (= 0x24)
22 echo 0x146
23 hlt
24 echo 0x16a
25 jz 0, -0x1f (= 7)
26 echo 0xff
27 hlt

It might seem that about the disappeared teams ( add r01, r01, 0and add r02, r02, 0) it was immediately obvious that they were meaningless. In fact, these commands took a meaningless form only after the appointment of physical registers, i.e. at the very last stage before outputting the finished p-code. Until then, the numbers of the n-registers of the operands were different; only the analysis that we just performed allowed us to combine them and delete the copying that made senseless — all this long before the appointment of physical registers.

Folding constants

Another standard optimization, which, like the previous ones, is implemented using DFA, is constant folding . The principle is very simple: if the values of the operands are known, then the operation can be performed immediately at compilation. For example, instead of code

MOV R1, 2
MOV R2, 3
ADD R3, R1, R2

we can generate immediately

MOV R3.5

Manipulating constants are not necessarily indicative of programmer negligence POLENOV calculate beforehand known result: for example, pixels=1024*768;it is easier to read and maintain than pixels=786432;

this time, in each instruction store a plurality of registers for which values are known, - together with the values: as As usual, we formulate three rules for calculating sets:std::map

in a command, the MOV R, Xvalue of R is known, and this is X ;
in any other command that sets the value of R, this value is unknown;
in team C, the value of R is known if it is known and the same in all teams from which you can go to C (directly or by jump).

We see again: the direction of the passage is forward (its value in the next depends on the value of the register in the previous command); operation in nodes - a union of unknown registers.

When the sets stabilize, we can replace each operation, both operands of which are known, with MOV.

The same data will allow us to perform another optimization - constant propagation (substituting a known value instead of a register reference). This optimization is not possible with the p-code format chosen by us, because there are no operations on the register and constant in it; such operations, however, are present in many real processors, so it will be possible to perform full-fledged “constant substitution” when generating executable code. Now we restrict ourselves to replacing the zero value with R0.

For example, a type construct if (1>2) { echo("unreachable"); }that compiles to

MOV R1, 1
MOV R2, 2
GT R3, R1, R2
JZ R3, label
ECHO "unreachable"
label:

will turn at the stage of folding constants in

MOV R1, 1
MOV R2, 2
MOV R3, 0
JZ R3, label
ECHO "unreachable"
label:

and the optimization “destruction of non-living code” already implemented by us last time will remove the first two teams MOV.
If at the same time we replace the zero value with R0:

MOV R3, 0
JZ R0, label
ECHO "unreachable"
label:

then along with the non-living code the last one will be deleted MOV, and the “destruction of the unreachable code” will also delete ECHO, turning it JZinto NOP.

Similarly, you can remove from code JZwith a known non-zero value. The second implemented “special case” is the replacement of commands ADD RX, (0), RYby ADD RX, RY, R0so that the copy cleaning algorithm recognizes copying from register to register in this command.

Another benefit of folding constants is that negative values can now be used in our commands. Due to the fact that in the lexer we set the token with NUMregexp [0-9]+, strings like "-123" were interpreted as the unary minus and then the literal 123; so they compiled into p-code like

MOV R1, 123
SUB R2, R0, R1

Now in the p-code there will be an honest team MOV R1, -123.

Implementation

struct commandnIt is supplemented by a couple of fields: The basis of optimization, as in previous cases, is DFA: The procedure implements the union of sets of unknown registers: a register with several known values is declared unknown. The last thing left is the actual calculation of the value when the operands are known. In the same way as in the j-script executor, we start a function for each opcode: Insert a call in front of it and finish with optimization. In the next post - collect from p-code-set of physical registers real executable code for x86 / x64.
std::map known; regset unknown;

void constp() {

        bool changed = false;

        foreach(i, pcode) {

            i->known.clear(); i->unknown.clear();

            if (i->cmd.opcode==command::mov) { // rule 1

                i->known[i->cmd.dest] = i->cmd.imm;

                changed = true;

            } else if(writesdest(i)) { // rule 2

                i->unknown.insert(i->cmd.dest);

                changed = true;

            }

        }

        while(changed) {

            changed = false;

            foreach2(i, pcode, next) {

                // rule 3 (next command)

                if (hasnext(i))

                    changed |= propagate(i, next);

                // rule 3 (jmp target)

                if (i->cmd.opcode==command::jz)

                    changed |= propagate(i, i->tgt);

            }

        }

        // заменим известные значения

        foreach(i, pcode) {

            i->known[0] = 0; // R0 известен всегда

            if(has2src(i) && i->known.count(i->cmd.src1) && i->known.count(i->cmd.src2))

                i->cmd = command(command::mov, i->cmd.dest, ops[i->cmd.opcode](i->known[i->cmd.src1],i->known[i->cmd.src2]));

            // подставляем 0

            if(has2src(i)) {

                if(i->known.count(i->cmd.src1) && !i->known[i->cmd.src1])

                    i->cmd.src1 = 0;

                if(i->known.count(i->cmd.src2) && !i->known[i->cmd.src2])

                    i->cmd.src2 = 0;

                if(i->cmd.opcode==command::add && !i->cmd.src1) { // чтоб распознавалось как копирование

                    i->cmd.src1 = i->cmd.src2;

                    i->cmd.src2 = 0;

                }

            }

            if(readsdest(i) && i->known.count(i->cmd.dest))

                if(!i->known[i->cmd.dest])

                    i->cmd.dest = 0;

                else // значение известно, но это не 0

                    if(i->cmd.opcode==command::jz) nopOut(i);

        }

}

propagate()

bool propagate(pcommandn from, pcommandn to) {

        bool changed = false; // возвращает: изменились ли множества

        // проверяем известные значения

        foreach(i, from->known) {

            regnum r = i->first;

            if(to->known.count(‌r))

                if((to->known[r]!=i->second) // другое, и не заданное правилом 1 

                    &&!((to->cmd.opcode==command::mov) && (to->cmd.dest==r))) {

                        to->known.erase(‌r);

                        to->unknown.insert(‌r);

                        changed = true;

                } else; // значение известное и верное

            else if(!to->unknown.count(‌r)) { // по умолчанию, известно

                to->known[r]=i->second;

                changed = true;

            }

        }

        // объединяем неизвестные

        foreach(r, from->unknown)

            if(!to->unknown.count(*r)) {

                to->unknown.insert(*r);

                to->known.erase(*r);

                changed = true;

            }

        return changed;

}

int hlt(int src1, int src2) { assert(false); return 0; }

int add(int src1, int src2) { return src1+src2; }

int sub(int src1, int src2) { return src1-src2; }

...

int lt(int src1, int src2) { return src1
int (*ops[])(int, int) = {hlt, hlt, hlt, hlt, hlt, hlt, hlt, add, sub, mul, div, eq, ne, ge, le, gt, lt};

constp();copies();

Tags: