Compilation. 8: optimization

    After a pleasant stay, we continue to write a compiler for our j-script.
    In a previous post, we implemented a heuristic taken from the ceiling for register assignment , and at the same time began to optimize the code. And before that, readers found a bug in the implementation of assignment.

    Further in the post:

    1. Fixing a bug
    2. Cleaning copies
    3. What happened?
    4. Folding constants
    5. Implementation

    Fixing a bug

    The fact is that we decided to cheat, and when assigning the value to a variable for the first time, do not actually copy, but simply declare the register with the intermediate value as the storage location for the variable: Then when compiling an operation of the type we get one command (from convolution 2) and remember (from convolution assignment). Everything is true, simple and natural. The problem arose when, on the right side of the assignment, not an intermediate value was used, but something long-lived: for example, another variable.
        ID '=' EXPR       { $$ = $3;
                                emit(command::add, vars[$1], $3, 0);
                                vars[$1] = $3; // new var

    a=2;MOV R1, 2vars["a"]=R1

    a = 2;
    b = a;

    In the second convolution of the assignment, a new code is not generated - we only remember vars["b"]=R1
    Both variables were in the same register, and when one of them changes, the second will also change.

    The solution lies on the surface: for each new variable we start a new register. Then from get already a couple of teams
        ID '=' EXPR       { $$ = $3;
                            if(!vars[$1]) vars[$1] = newreg();
                            emit(command::add, vars[$1], $3, 0);

    MOV R1, 2
    ADD R2, R1, 0
    and remember vars["a"]=R2
    if it is followed b = a;- it will be added to the code MOV R3, R2, 0, and vars["b"]=R3

    other words, is now in the generated code will be a lot of extra copy from register to register, and the register assignment will run more slowly - because of the fact that the number of registers used is growing.
    We will try to find those cases where copying is unnecessary (for example, if the variable from the right side of the assignment does not change further), and we will correct the commands that use the copy so that they use the original.

    Cleaning copies

    Copy elimination is one of the simple optimizations that I promised from the very first post of the series. As for the optimizations performed during register assignment, data-flow analysis is convenient for cleaning. An important difference from the two previous applications of DFA (going back to detect live registers, going forward to find saved registers) will be that at each point we store not one set of registers, but the division of all registers into sets of the same ones. You can look at this as a more general case of DFA than the two previously discussed. (Before, registers were always divided into two sets - “included” and “excluded.”)

    How will we split the registers?
    • the command ADD RA, RB, 0carries over RAinto the set k RB;
    • any other case-altering command puts it into a singleton set;
    • RAand RBthey will be together in team C if they are together in all teams from which you can go to C (directly or by jump).
    After receiving the final splitting, it will be seen where the original can be used instead of the copy: RAyou can use instead RBif they are together in all the commands where it is used RA.

    And one more subtlety: since during the transitions from team to team we do not build up sets, but truncate (we intersect all incoming sets), before starting DFA, you need to initialize sets not into empty, but into comprehensive ones - and as the algorithm works, the sets are truncated, as necessary. In order not to spend money and not really hold in every team the set of all existing registers, we will agree to consider the “missing iterator” pointing to such a comprehensive set.

    For convenience, we format the three operations we need on partitions into a class. In the partition, we store the list of sets into which the registers are divided (except for the “global” set, in which the registers are all together initially), and for each register (except those in the “global” set), an iterator of the set in which it is located . I had some strange problems

    with wood std::set, so I wrote myself a container bit::setwith a similar interface, but with the std::bitsetinside. By the template parameter, it takes the maximum key value in the set.
    At the same time, bit::setstandard operations on sets ( &=, |=) are taken out . In adding a new field is now the usual way implement DFA, using the ready-operation:

    typedef bit::set regset;

    class regPartition {
        typedef std::list regsets;
        regsets sets;
        std::map byreg;
    // изначально: все регистры в "глобальном" разбиении

        // возвращает: изменилось ли разбиение
        bool add(regnum copy, regnum orig) {
            if (byreg.count(copy)) {
                if(byreg[copy]==byreg[orig]) // уже вместе
                    return false;
                // был последним?
            byreg[copy] = byreg[orig];
            return true;

        void remove(regnum r) {
            if (byreg.count(‌r)) {
                if(byreg[r]->size()==1) return; // уже один
            byreg[r] = sets.insert(sets.end(), regset());

        // возвращает: изменилось ли разбиение
        bool isect(/*const*/regPartition& p, const command& self) {
            bool changed = false;
            // оставить вместе только тех, кто вместе и в p
            foreach(i, byreg) {
                regnum r = i->first;
                // split by p.byreg[r]
                regsets::iterator withR = i->second,
                             withoutR = sets.insert(sets.end(), regset());
                foreach2(j, (*withR), next)
                    // не разделяем, если текущая команда -- mov с теми же регистрами:
                    //   всё равно на следующем шагу объединятся
                    if(!(self.opcode==command::add && !self.src2 &&
                        ((self.src1==r && self.dest==*j)||(self.dest==r && self.src1==*j)))
                    &&((!p.byreg.count(‌r) && p.byreg.count(*j)) ||  // R in global, J isn't
                        (p.byreg.count(‌r) && !p.byreg[r]->count(*j)))) { // R not in global
                            byreg[*j] = withoutR;
                if(!withoutR->size()) // ничего не отделилось
                else // разделили
                    changed = true;
            // теперь те, что в глобальном в this, но не в p
            foreach(i, p.sets) {
                regset inP;
                foreach(j, (*i))
                    if(!byreg.count(*j)) inP.insert(*j);
                if(inP.size()) {
                    regsets::iterator newSet = sets.insert(sets.end(), inP);
                    foreach(j, inP) byreg[*j] = newSet;
                    changed = true;
            return changed;

    // применение разбиения: формируем для r множество возможных замен (во всём коде)
        void apply(regnum r, regset& global) {
            if (!r) return; // may not be replaced
            if(!global.size()) // uninitialized set
                global = *byreg[r];
            else // initialized; intersect
                global &= *byreg[r];

    struct commandnregPartition copies;

    void copies() {
            // а) вычисляем разбиения для каждой команды
            bool changed = false;
            foreach(i, pcode) {
                i->copies = regPartition();
                // rule 2
                if (writesdest(i)) {
                    changed = true;
            while (changed) {
                changed = false;
                foreach2(i, pcode, next) {
                    // rule 1
                    if (i->cmd.opcode==command::add && !i->cmd.src2)
                        changed |= i->copies.add(i->cmd.dest, i->cmd.src1);
                    // rule 3 (next command)
                    if (hasnext(i))
                        changed |= next->copies.isect(i->copies, next->cmd);
                    // rule 3 (jmp target)
                    if (i->cmd.opcode==command::jz)
                        changed |= i->tgt->copies.isect(i->copies, i->tgt->cmd);
            // б) вычисляем возможные замены во всём коде
            std::vector copies(lastreg+1);
            foreach(i, pcode) {
                    i->copies.apply(i->cmd.dest, copies[i->cmd.dest]);
                if(has2src(i)) {
                    i->copies.apply(i->cmd.src1, copies[i->cmd.src1]);
                    i->copies.apply(i->cmd.src2, copies[i->cmd.src2]);
            // в) объединяем копии
            for(regnum r=1; r<=lastreg; r++) {
                if(copies[r].size()) {  // остались возможные замены?
                    regnum s = *(copies[r].begin()); // заменим r на s
                    foreach(i, pcode) { // во всём коде
                            i->cmd.dest = s;
                        if(has2src(i)) {
                            if(i->cmd.src1==r) i->cmd.src1 = s;
                            if(i->cmd.src2==r) i->cmd.src2 = s;
                        if(i->cmd.opcode==command::add && i->cmd.src1==i->cmd.dest && !i->cmd.src2) // self-mov
                    foreach(c, copies)  // и в векторе замен
                        if(c->count(‌r)) {
    We copies();insert the call at the very beginning of the optimization cycle, before checking liveliness.

    What happened?

    Compared to the last time, the code was reduced by a couple more commands:
    00 mov r01,0
    01 mov r02, 0x3e8
    02 echo 0x126
    03 echo r01
    04 echo 0xa0
    05 echo r02
    06 echo 0xa7
    07 le r03, r01, r02
    08 jz r03, + 0x1d (= 0x26)
    09 add r03, r01, r02
    0a mov r04, 2
    0b div r03, r03, r04
    0c echo 0x162
    0d echo r03
    0e echo 0xcc
    0f input r04
    10 store r01, 1
    11 mov r01, 1
    12 eq r01, r04, r01
    13 jz r01, +4 (= 0x18)
    14 load r01, 1
    15 mov r02, 1
    16 sub r02, r03, r02
    17 jz 0, -0x11 (= 0x7)
    18 mov r01, 2
    19 eq r01, r04, r01
    1a jz r01, +3 (= 0x1e)
    1b mov r01, 1
    1c add r01, r03, r01
    1d jz 0, -0x17 (= 0x7)
    1e load r01, 1
    1f mov r03, 3
    20 eq r03, r04, r03
    21 jz r03, +2 (= 0x24)
    22 echo 0x146
    23 hlt
    24 echo 0x16a
    25 jz 0, -0x1f (= 7)
    26 echo 0xff
    27 hlt

    It might seem that about the disappeared teams ( add r01, r01, 0and add r02, r02, 0) it was immediately obvious that they were meaningless. In fact, these commands took a meaningless form only after the appointment of physical registers, i.e. at the very last stage before outputting the finished p-code. Until then, the numbers of the n-registers of the operands were different; only the analysis that we just performed allowed us to combine them and delete the copying that made senseless — all this long before the appointment of physical registers.

    Folding constants

    Another standard optimization, which, like the previous ones, is implemented using DFA, is constant folding . The principle is very simple: if the values ​​of the operands are known, then the operation can be performed immediately at compilation. For example, instead of code
    MOV R1, 2
    MOV R2, 3
    ADD R3, R1, R2
    we can generate immediately
    MOV R3.5

    Manipulating constants are not necessarily indicative of programmer negligence POLENOV calculate beforehand known result: for example, pixels=1024*768;it is easier to read and maintain than pixels=786432;

    this time, in each instruction store a plurality of registers for which values are known, - together with the values: as As usual, we formulate three rules for calculating sets:std::map
    • in a command, the MOV R, Xvalue of R is known, and this is X ;
    • in any other command that sets the value of R, this value is unknown;
    • in team C, the value of R is known if it is known and the same in all teams from which you can go to C (directly or by jump).
    We see again: the direction of the passage is forward (its value in the next depends on the value of the register in the previous command); operation in nodes - a union of unknown registers.

    When the sets stabilize, we can replace each operation, both operands of which are known, with MOV.

    The same data will allow us to perform another optimization - constant propagation (substituting a known value instead of a register reference). This optimization is not possible with the p-code format chosen by us, because there are no operations on the register and constant in it; such operations, however, are present in many real processors, so it will be possible to perform full-fledged “constant substitution” when generating executable code. Now we restrict ourselves to replacing the zero value with R0.

    For example, a type construct if (1>2) { echo("unreachable"); }that compiles to
    MOV R1, 1
    MOV R2, 2
    GT R3, R1, R2
    JZ R3, label
    ECHO "unreachable"
    will turn at the stage of folding constants in
    MOV R1, 1
    MOV R2, 2
    MOV R3, 0
    JZ R3, label
    ECHO "unreachable"
    and the optimization “destruction of non-living code” already implemented by us last time will remove the first two teams MOV.
    If at the same time we replace the zero value with R0:
    MOV R3, 0
    JZ R0, label
    ECHO "unreachable"
    then along with the non-living code the last one will be deleted MOV, and the “destruction of the unreachable code” will also delete ECHO, turning it JZinto NOP.

    Similarly, you can remove from code JZwith a known non-zero value. The second implemented “special case” is the replacement of commands ADD RX, (0), RYby ADD RX, RY, R0so that the copy cleaning algorithm recognizes copying from register to register in this command.

    Another benefit of folding constants is that negative values ​​can now be used in our commands. Due to the fact that in the lexer we set the token with NUMregexp [0-9]+, strings like "-123" were interpreted as the unary minus and then the literal 123; so they compiled into p-code like
    MOV R1, 123
    SUB R2, R0, R1
    Now in the p-code there will be an honest team MOV R1, -123.


    struct commandnIt is supplemented by a couple of fields: The basis of optimization, as in previous cases, is DFA: The procedure implements the union of sets of unknown registers: a register with several known values ​​is declared unknown. The last thing left is the actual calculation of the value when the operands are known. In the same way as in the j-script executor, we start a function for each opcode: Insert a call in front of it and finish with optimization. In the next post - collect from p-code-set of physical registers real executable code for x86 / x64.
    std::map known; regset unknown;

    void constp() {
            bool changed = false;
            foreach(i, pcode) {
                i->known.clear(); i->unknown.clear();
                if (i->cmd.opcode==command::mov) { // rule 1
                    i->known[i->cmd.dest] = i->cmd.imm;
                    changed = true;
                } else if(writesdest(i)) { // rule 2
                    changed = true;
            while(changed) {
                changed = false;
                foreach2(i, pcode, next) {
                    // rule 3 (next command)
                    if (hasnext(i))
                        changed |= propagate(i, next);
                    // rule 3 (jmp target)
                    if (i->cmd.opcode==command::jz)
                        changed |= propagate(i, i->tgt);
            // заменим известные значения
            foreach(i, pcode) {
                i->known[0] = 0; // R0 известен всегда
                if(has2src(i) && i->known.count(i->cmd.src1) && i->known.count(i->cmd.src2))
                    i->cmd = command(command::mov, i->cmd.dest, ops[i->cmd.opcode](i->known[i->cmd.src1],i->known[i->cmd.src2]));
                // подставляем 0
                if(has2src(i)) {
                    if(i->known.count(i->cmd.src1) && !i->known[i->cmd.src1])
                        i->cmd.src1 = 0;
                    if(i->known.count(i->cmd.src2) && !i->known[i->cmd.src2])
                        i->cmd.src2 = 0;
                    if(i->cmd.opcode==command::add && !i->cmd.src1) { // чтоб распознавалось как копирование
                        i->cmd.src1 = i->cmd.src2;
                        i->cmd.src2 = 0;
                if(readsdest(i) && i->known.count(i->cmd.dest))
                        i->cmd.dest = 0;
                    else // значение известно, но это не 0
                        if(i->cmd.opcode==command::jz) nopOut(i);

    bool propagate(pcommandn from, pcommandn to) {
            bool changed = false; // возвращает: изменились ли множества
            // проверяем известные значения
            foreach(i, from->known) {
                regnum r = i->first;
                    if((to->known[r]!=i->second) // другое, и не заданное правилом 1
                        &&!((to->cmd.opcode==command::mov) && (to->cmd.dest==r))) {
                            changed = true;
                    } else; // значение известное и верное
                else if(!to->unknown.count(‌r)) { // по умолчанию, известно
                    changed = true;
            // объединяем неизвестные
            foreach(r, from->unknown)
                if(!to->unknown.count(*r)) {
                    changed = true;
            return changed;

    int hlt(int src1, int src2) { assert(false); return 0; }
    int add(int src1, int src2) { return src1+src2; }
    int sub(int src1, int src2) { return src1-src2; }
    int lt(int src1, int src2) { return src1 int (*ops[])(int, int) = {hlt, hlt, hlt, hlt, hlt, hlt, hlt, add, sub, mul, div, eq, ne, ge, le, gt, lt};


    Also popular now: