Part 1/3. The ideal VM compiler for ICFPC 2009, on Haskell, with popularizing comments

    Here we will spell a certain program (VM compiler) on Haskell. This compiler is given a binary file with instructions from a certain processor, where some calculations are described in these instructions. The output of our compiler is the text of the program, also on Haskell, which performs the same calculations at high speed. Perhaps this is not a compiler, but a decompiler, I do not know. A comparison of the work of the resulting Haskell / Java programs is given in a previous post .



    Warning: people with formal mathematical thinking are forbidden to read ! You already know everything. This post is for people who are interested in something else beyond the usual C / C ++ / C # / Java, because in the last 5 years they are tired of writing the same thing in these wonderful languages.

    About the initial task of the ICFP contest again, with technical details: we are given a binary file in which the program (calculating formulas) written in the machine instructions of some processor. There are several formulas at once, therefore, when calculating formulas, a special pool is used for the returned values. There is a pool for input values. And there is a pool for storing the values ​​required during the calculation of formulas, and for values ​​that live from iteration to iteration (i.e. when the program is run many times, it can use data from the last run). The input pool in terms of the problem is called a set of input ports, similarly, there are output ports, everything else is called simply “memory”.

    The calculations are performed as follows: through additions, subtractions and other arithmetic operations on data from the input ports and memory, a value is obtained that is written to one of the output ports. According to the authors, the binary file from the task is a program for the “black box”, which is able to calculate something for us.

    Implementing this black box so that it can be calculated quickly generally is our first goal in ICFPC 2009, and the rest (to calculate the control actions on the satellite so that it flies correctly in the world described by these formulas) is the main goal of the contest, which we We will not consider this post, or maybe in another.

    In this binary, the original data was mixed up with the code as follows:
    Displacement (dec)
    Length
    Content
    A type
    O
    8
    data (cell 0)Double
    8
    4
    opcode (cell 0)UInt32
    12
    4
    opcode (cell 1)UInt32
    16
    8
    data (cell 1)Double
    then the same order is repeated for cells 2,3, then 4,5, etc. The specification says: for even cells, data first goes, then an opcode, for odd cells, vice versa. Why is there data here, why do data other than zero exist at all? Very simple, 1) it is impossible to hammer a constant in opcodes, a variable (memory cell) is used, which is not modified by anyone 2) some variables are initialized, and they also lie in the data.

    In order to compile such a binary into a beautiful text in any language, you must:

    1) read the binary into the buffer
    2) separate the data from the opcodes
    3) build AST from opcodes (AST in these terms is a disassembled opcode decomposed into a form convenient for manipulation, in the original it means Abstract Syntax Tree, an abstract tree because it contains the essence of the program, and not its textual or binary representation)
    4) transform AST in another form (higher level)
    5) display AST in text form in the required language, as well as derive the data definition with its original value.

    Next is a program with comments for people completely far from Haskell, so some points in the explanations are deliberately simplified and even slightly distorted (marked separately), for ease of perception.

    Connection of some standard modules:

    import System.IO

    import Foreign.Marshal.Alloc

    import Foreign.Ptr

    import Data.Word

    import Foreign.Storable

    import Debug.Trace

    import Data.Bits

    import Data.List

    import Data.Maybe

    Description of type synonyms, for details of the following data definitions: it says that the source address in the command is of type Int, as well as the destination address.

    type DestAddr = Int

    type SrcAddr = Int

    Now (below) comes the AST element, that is, the decoded command from the specification. Commands usually contain addresses of input data (or several pieces, for example, an addition command) - the type SrcAddr is mentioned, and one address to which they write the result (DstAddr is mentioned).

    In our task with AST, different things are done: several teams are replaced by one that is not in the specification, then all commands are replaced by some kind of tree in general, whose elements are the same commands, but from a different set (more on that later). It would be better to divide the sets of commands into different types, but in this case we have a dump here, and only comments help to separate the set from the set (see below).

    In Haskell, there is a so-called algebraic data type, which is generally everywhere, and it is very convenient for use in our case. We create an Op type (from the word opcode) and assign several constructors to it (these are not OOP constructors, but if you think in terms of OOP, then an empty Op class is declared here, a lot of classes derived from it, and each has its own data).

    data Op = 

    -- это конструкторы использующиеся в дереве. В частности, листья дерева:

     Const Double |

     ReadExp SrcAddr |  -- read main memory

     ReadVarExp SrcAddr | -- read temporary variable

     ReadPortExp SrcAddr | -- read port

                   -- узлы дерева:

              SqrtExp Op | SubExp Op Op | AddExp Op Op | MulExp Op Op | DivExp Op Op | -- math

     IfExp CmdOp Op Op Op  | -- branch: compare 1st op with zero, choose from other ops.

    -- плоские операции (из спецификации+дополнительные)

     Add DestAddr SrcAddr SrcAddr |

     Sub DestAddr SrcAddr SrcAddr |

     Mul DestAddr SrcAddr SrcAddr |

     Div DestAddr SrcAddr SrcAddr |

     Output SrcAddr SrcAddr |         -- copy from memory to out port.

     If CmdOp SrcAddr DestAddr SrcAddr SrcAddr |  --  compare mem with 0,choose,store

     Input DestAddr SrcAddr |   -- read src port into memory

     Sqrt DestAddr SrcAddr |

     Copy DestAddr SrcAddr |

    -- низкоуровневые плоские операции, от которых мы избавляемся практически сразу

     Cmpz CmdOp SrcAddr |   -- first part of "If"

     Phi DestAddr SrcAddr SrcAddr |    -- second part of "If"

     Noop DestAddr  

                                    deriving Show

    We see that the options we have are listed through the symbol "|", traditionally meaning "or." There is also a “deriving Show", meaning that if we want to translate some value of this type to a string, we will get a string on the Haskell, approximately close to the original. This construction means literally the following: “compiler, make the indicated type belong to the Show class, and show the method of the Show class, which translates from value to string, generate as you can.”

    Now about what we will have about the tree. Suppose in the resulting code the expression is
    generated : o32 = (i2 * 37) + (m75 * (m76 * t4))
    It means: put in the output port the value of the input port times 37 plus the product of two memory cells and one temporary variable. The expression to the right of the brackets is written in terms of our AST as follows: If you count the brackets, it should converge 8). What do we have in the leaves of a tree? Constants, input ports, memory, and temporary variables. How do we distinguish one from the other? After all, according to the specification, we only have input ports and memory? Where do the constants come from, for example?

         AddExp (MulExp (ReadPortExp 2) (ConstExp 37)) ...

             ... (MulExp (ReadExp 75) (MulExp (ReadExp 76) (ReadVarExp 4)))   -- lisp? ;)




    Very simple. If no one writes to our cell, but everyone reads from it, then this is a constant. It makes sense to remove it from “memory” and hammer it with a constant in the resulting code. Similarly, temporary variables are separated from constant variables, as well as variables that should not exist at all, that is, they must be converted directly into expressions with brackets (see a few below).

    We continue. The following is a piece of AST for conditional operations (comparison with zero, from the specification). There are five comparisons; according to the comparison results, a choice is made from one or another memory address.

    data CmdOp = Ltz | Lez | Eqz | Gez | Gtz deriving Enum

    Of interest is the deriving enum design. It automatically assigns this type to the Enum class, and the types of this class have a method that allows you to translate an instance of the type into an integer, and vice versa. Will be used in the next paragraph.

    Now let's define that the CmdOp type belongs to the Show class. This class is used to translate values ​​of some type to a string type, as mentioned above in the "deriving Show". In our case, we do not want the compiler to generate the conversion methods by default, but we will implement our own method (function).

    instance Show CmdOp where

    show cmdOp = ["<0","<=0","==0",">=0",">0"] !! fromEnum cmdOp

    Actually, the function is show, it takes the cmdOp parameter, returns a string, one from the list. The expression to the right of the equal sign means: transfer from the value to an integer (fromEnum, recall the deriving Enum), then use this integer as an index (function "!!") to select from the list (list in square brackets). Briefly and clearly, but O (N), so what. O (N) here means that to get the fifth element of the list, 5 iterations are required (because the list is used here, not the array), but for the Nth, therefore, N iterations, which is long. There is a simple opportunity to speed up by writing case or using an array, but I saved the lines of the program 8-)

    Next is a function that translates Op into the language we are compiling into. We compile to Haskell, and therefore the following introduction is required:

    Our black box at the input has input ports, and its previous state, and the output has its own state and output ports. And the binary file compiled in Haskell will contain one function that will have the same data at the input and output. We will return output ports as part of the structure. Thus, the compiled file will contain:

    1) data description: 2) the main function (discussed in detail below): 3) the function that returns the state of the black box at time 0 - that is, the data that we will feed it iterate over iteration:

    data VMState = VMState { m13, m55, m77, o11 :: Double }  -- выходные порты и persistent mem



    nextState :: VMState -> (Double,Double,Double...-> VMState

    getNextState x (i2,i3,i16000...=                                                                                       (1)

       let t1 = i2+i3 + m55 x;

            t2 = if (i-16000 > O) then t1*(i16000+1) else O

       in VMState { m13 = t1+t2*2, m55 = 2*i3..... o10 = (i2+22)*576000.0 }




    initState = VMState { m13=O, m55=123.456, ...o90 = O}

    Let's consider each row in points. The first declares the "structure" of the data. Its difference from C / Java structures is that its fields cannot be changed, it is only possible to get a new structure with some (or one, or all) changed values ​​of the current structure. It is written like this: Then, if desired, you can throw out the old data. The syntax used to read one field from the structure is: Why is it so strange? Because when you declare a structure, all functions for accessing its fields are automatically declared. In particular, a function appears implicitly (for example): This means that the function receives the VMState value as an input (in our example, we passed the whole newStruct at the top), and returns a numerical value with a floating point.

         let newStruct = oldStruct { field5 = 7, field10 = O }




         let myFieldValue = field5 newStruct



    field5 :: VMState -> Double 


    Along the way, we realized what the method signatures look like (method name :: type). We look at the main function. There are no ellipses in the Haskell, I added them to illustrate that we will declare a tuple of as many doubles as we need. The function takes 2 parameters, the first is VMState, the second consists of several values ​​(in our case, these are the values ​​of all ports), and this function returns the new VMState value. The main function of an ideal VM has a syntax similar to the following:

    nextState :: VMState -> (Double,Double,Double...-> VMState





    fun arg1 arg2 = 

      let bind1 = expr1; bind2 = expr2

      in exprResult

      where bind3 a1 a2 a3 = expr3

                 bind4 a5 a6 = expr4


    That is, we can define some reusable expressions in the "let" section, and then they are free to use in the resulting expression (which is in the "in"). We can also define the same expressions in the “where” section, only there they usually describe not variables, but functions that are local (apparently) to the enclosing function (fun). Functions can also be described in the let section, but this is less convenient. I add that from where the values ​​from the "let" section are not visible.

    Further, I note that the “if” construct in Haskell is an expression. There is no "for" construct in the language itself; it is in a separate module and is a normal function. As well as "while" and the like.

    In light of the foregoing, we continue to consider the binary compiler.

    The following function translates the AST element (of type Op) into a string representation in the language of the result (in Haskell). The first parameter is a line containing the name of the parameter with the initial state (because we will generate read operations from it), in the formula (1) it will be “x”. In fact, I could hardcode this name, because it occurs only once.

    ast2haskell :: String -> Op -> String

    ast2haskell x op = s op where 

    s (Const d) = show d  -- константа показывается как есть.

    s (ReadExp r1) = "m"++(show r1)++" "++x
    -- generate read access to memory

    s (ReadVarExp r1) = "t"++(show r1)
                             -- to temporary variable

    s (ReadPortExp r1) ="i"++(show r1)
                            --  to input port

    s (SqrtExp op') = "(sqrt "++s op'++")"

    s (AddExp op1 op2) = "("++s op1++"+"++s op2++")"

    s (SubExp op1 op2) = "("++s op1++"-"++s op2++")"

    s (MulExp op1 op2) = "("++s op1++"*"++s op2++")"

    s (DivExp op1 (Const x)) = "("++s op1++"/"++(show x)++")"

    s (DivExp op1 op2) = "(if "++s op2++" /= 0 then "++s op1++"/"++s op2++" else 0)"

    s (IfExp cond opc op1 op2) = "(if "++s opc++show cond++" then "++s op1++" else "++s op2++")"

    What do we have here? The familiar construction is fun param = expr where ...., but only one of the local functions is called right away. Why? Now let's figure it out. First, we just note that the function “s” (from “string”) has several definitions, one definition for each constructor of type Op. As a result, of course, only one function will be called, and immediately a specific object of type Op will break into its components, which will be given names:

    :: Op -> String

    s (AddExp op1 op2) = "("++s op1++"+"++s op2++")"


    This is an implementation of the s function for an Op type constructor named AddExp that has two parameters: two AST branches that add up. And the function s itself has one argument, so there are brackets around AddExp. The result of this function will be a string glued from pieces (operation ++ connects two lists; strings are also lists). Here brackets are glued on the sides, and between them are string representations of both branches, between which a plus is glued. The string representation of each branch, therefore, is calculated recursively; this is a common practice in such tasks. And finally, the show function translates the number into a string, and it is also used for CmdOp (a condition for comparing with zero), which also belongs to the Show class (described above).

    Note the following definition:

    s (DivExp op1 (Const x)) = "("++s op1++"/"++(show x)++")"

    s (DivExp op1 op2) = "(if "++s op2++" /= 0 then "++s op1++"/"++s op2++" else 0)"


    It is remarkable in that both functions work on DivExp data, but to select the first one it is necessary that the divisor in this node be a constant! A special, simpler code is generated for this case, because there is no need to check for 0.

    It follows that determining which function is called does not follow the “constructor type”, but is much more flexible: the input value takes turns trying “Crawl” into each implementation of the function with the required name, and the function into which it creeps is called. In this case, DivExp does not crawl into the first function, in which the second parameter is initialized with something other than Const. This is called Pattern Matching (not to be confused with it for strings).

    Now about what the "s" function was made for. Exclusively for convenience. Compare 2 pieces, with it: and without it: Conclusion: the code with the function “s” is shorter, and you do not need to constantly drag this “x”, passing it on.

    ast2haskell x op = s op where 

    s (SubExp op1 op2) = "("++s op1++"-"++s op2++")"

    s (MulExp op1 op2) = "("++s op1++"*"++s op2++")"




    ast2haskell x (SubExp op1 op2) = "("++ast2haskell x op1++"-"++ast2haskell x op2++")"
    ast2haskell x (MulExp op1 op2) = "("++ast2haskell x op1++"*"++ast2haskell x op2++")"




    Let us now consider a new topic - how decoding of an opcode into an operation is performed, which is represented by the Op type in our work. Each operation initially lies at some address in the binary file. For most operations, the same address is the destination address for the result of the operation (the result of addition and other operations - everything is stored in the “memory” at this address, for each operation its own). That is why we pass a pair of “opcode, its address” into the function, and at the output we have an already initialized operation, in which, in addition to the destination address, additional arguments are initialized, for each operation its own.

    In the specification of the opcodes, they are decoded in cascade, that is, if the highest part of the opcode is zero, then the next piece looks, otherwise it is the finished opcode. The highest part is the four high bits, i.e. 4 bits up, starting from the 28th bit (28, 29, 30 , 31).

    To bite bits out of Word32, we have defined a function (.%.) Here that can be used as an operation (infix, that is, stand between two operands, like + or *). The Haskell syntax allows you to define a function with a name of any kind of character, for example, "++++" It is defined this way: or otherwise: it is used like this:

    (++++) a b = sqrt(a*+ b*b)



    ++++ b = sqrt(a*+ b*b)



    let q = 3 ++++ 4

    let p = (++++) 3 4


    Thus, it turns out very convenient. Also, you can make any of the “ordinary” functions of the two arguments infix, see the example below with 'shiftL' (left shift). For these purposes, the function name is taken in inverted commas and placed between operands. There are all sorts of priorities for infix operators, but this is not our topic now, but the function itself (.%.) Will be defined below.

    disasm :: (Word32, Int) -> Op

    disasm (opcode, addr) = disasm1 (opcode .%. (28, 4))  -- декодируется первый "каскад"
                                             -- заданный четырьмя старшими битами

       where

           r1 = recast $ opcode .%. (14,14)  -- а это выкусываем из опкода сразу

           r2 = recast $ opcode .%. (O,14)  nbsp; -- два аргумента

           r1' =    -- а это для красоты для второго каскада

           disasm1 i = [disasm2 $ opcode .%. (24,4) , Add addr r1 r2,Sub addr r1 r2,Mul addr r1 r2, Div addr r1 r2,Output r1 r2,Phi addr r1 r2] !! recast i

    What happens in disasm1? Again definition of the list (square brackets) with selection by index (operation "!!"). The recast function is described below, it converts the sign to unsigned, because we contacted Word32, and not just Int - and Haskell doesn’t really like to interfere with them, this is not C. Well, what about the list in question? In the zero index, we are sent to decode the second stage (the operation code is encoded with 4 bits starting from 24 bits up). Next are the operations to add, subtract, and further according to the specification. Operation Phi will be discussed separately later.

    disasm2 i = [Noop addr, Cmpz (decodeImm (opcode .%. (21,3))) r1',

    Sqrt addr r1',Copy addr r1',Input addr r1'] !! recast i

    The second cascade, which is also the final one, consists of the same meat. In the specification, r1 for these operations lies in the same place as r2 for the first stage, which is why there is a synonym for r1 ', for beauty. There is also a selection from the list by index. The first (starting from zero) contains a complex expression - the Cmpz constructor, for which the argument is calculated by special decoding from an opcode piece (type of comparison with zero):

    decodeImm i = [Ltz,Lez,Eqz,Gez,Gtz] !! recast i -- а могли бы написать ту же функцию и через enum (благо он определен для этого типа).

    And here are the bit-biting operations themselves - a live broadcast from C.

            mask x n = x .&. ((1 `shiftL` n) - 1)  -- маскирование = отрезание n младших битов из x

            (.%.) w32 (start,len) = (w32 `shiftR` start) `mask` len  -- сдвиг, затем маскирование

    And this function (recast) converts from signed to unsigned and vice versa, depending on the context. It is not very effective, but it works fine in our task. Types are not explicitly specified here, only their class (Integral) is indicated. Here, the Haskell will perfectly substitute the necessary types in each case (Int or Word32) and for each of them there are operations fromInteger / toInteger. Note that types are deduced during compilation, and not in runtime (I recall this just in case).

    recast :: (Integral a, Integral b) => a -> b

    recast x = (fromInteger . toInteger) x  -- то же, что "fromInteger (toInteger x)", только заумнее

    and you can also write like this:

    recast x = fromInteger $ toInteger x -- это совсем на заумно, а придумано специально

       -- для тех, кто не любит скобки: выражение справа от

       -- знака $ скармливается целиком как последний

       --  аргумент (или как единственный) функции слева.

    and you can also write like this:

    recast = fromInteger . toInteger   -- совсем заумно, зато жуть как коротко.
      -- Ну, а затем проходит время, и такие конструкции
      -- читаются на ура


    End of 1 part (size limit), the second part will follow .

    Also popular now: