The shortest introduction to compiler creation

  • Tutorial

Here I tried to show in practice what are some important concepts from the field of compiler creation. It is likely that such 15-minute completed stories may be a good way to dive into complex topics. Only it would be good not to passively read what is presented below, but also to check the code in the work.


If the first experience proves successful, then in the future you can expect other 15-minute "sketches" on the subject of compilers.


What will be discussed


Let's make an arithmetic expressions compiler. One that will translate the source text in the reverse Polish form of the record (it is also called RPN or POLIZ) into the intermediate code working with the stack. But we will do without interpreters. Then we immediately translate the result into a representation in the C language. That is, we will have a compiler from RPN to C.


By the way, we will write the compiler in Python. But let it not stop those who prefer some other programming language. Here is a useful exercise for you: translate the code in your favorite language. Or use the already prepared translation:


Implementation on F # (by gsomix ):
first version of
SSA version


Let's start with parsing


def scan(source):
  tokens = source.split()
  return [("Push", int(x)) if x[0].isdigit() else ("Op", x) for x in tokens]

What did we do here? The scan function receives a string from the user in the reverse Polish notation ("2 2 +").


And at the output we get its intermediate presentation. Here it is, for example:


[
  ('Push', 2),
  ('Push', 2),
  ('Op', '+')
]

So, we already got the compiler. But it is very frivolous. Recall that initially it was about the C code.


Let's do the broadcast in C


def trans(ir):
  code = []
  for tag, val in ir:
    if tag == "Push":
      code.append("st[sp] = %d;" % val)
      code.append("sp += 1;")
    elif tag == "Op":
      code.append("st[sp - 2] = st[sp - 2] %s st[sp - 1];" % val)
      code.append("sp -= 1;")
  return "\n".join(code)

What's going on here? Let's look at the output of this function (in the same example with "2 2 +").


st[sp] = 2;
sp += 1;
st[sp] = 2;
sp += 1;
st[sp - 2] = st[sp - 2] + st[sp - 1];
sp -= 1;

Yes, it already looks like C code. The st array plays the role of a stack, and sp its pointer. Usually virtual stack machines work with these things.


That's just the machine itself - we don’t have an interpreter. There is a compiler. What is left for us? You need to add the necessary framing for the C program.


Our first off-the-shelf compiler


ST_SIZE = 100
C_CODE = r"""#include <stdio.h>
int main(int argc, char** argv) {
int st[%d], sp = 0;
%s
printf("%%d\n", st[sp - 1]);
return 0;
}"""
def scan(source):
  tokens = source.split()
  return [("Push", int(x)) if x[0].isdigit() else ("Op", x) for x in tokens]
def trans(ir):
  code = []
  for tag, val in ir:
    if tag == "Push":
      code.append("st[sp] = %d;" % val)
      code.append("sp += 1;")
    elif tag == "Op":
      code.append("st[sp - 2] = st[sp - 2] %s st[sp - 1];" % val)
      code.append("sp -= 1;")
  return "\n".join(code)
def rpn_to_c(source):
  return C_CODE % (ST_SIZE, trans(scan(source)))
print(rpn_to_c("2 2 +"))

It remains to compile the output of this program by the C compiler.


Are you still ready to continue? Then let's discuss what we did. There is one dubious point - our compiler translates constant expressions, and in fact they can be calculated simply at the compilation stage. It makes no sense to translate them into code. But let's assume for now that some arguments may fall into the stack from the outside. Let us dwell on the fact that the practical meaning of our development can be given later. Now it is important to get a general idea about the construction of simple compilers, right?


Compiler using the SSA form


Do you like the headline? SSA - it sounds very solid for any compiler. And we will now use this very SSA. What is it? Let's move in order.


We are currently generating C code without any virtual machines. But why do we need a rudiment in the form of operations with the stack? Let's replace these operations with ordinary variables from C. Moreover, we will not save variables - for each expression we will get a new name. Let the C compiler deal with all of this. It turns out that each variable is assigned a value only once. And this, by the way, is the form of SSA .


Here is our new compiler.


C_CODE = r"""#include <stdio.h>
int main(int argc, char** argv) {
%s
printf("%%d\n", %s);
return 0;
}"""
def scan(source):
  tokens = source.split()
  return [("Push", int(x)) if x[0].isdigit() else ("Op", x) for x in tokens]
def trans(ir):
  stack, code = [], []
  name_cnt = 0
  for tag, val in ir:
    if tag == "Push":
      code.append("int t%d = %d;" % (name_cnt, val))
      stack.append("t%d" % name_cnt)
      name_cnt += 1
    elif tag == "Op":
      a, b = stack.pop(), stack.pop()
      code.append("int t%d = %s %s %s;" % (name_cnt, b, val, a))
      stack.append("t%d" % name_cnt)
      name_cnt += 1
  return "\n".join(code), stack.pop()
def rpn_to_c(source):
  return C_CODE % trans(scan(source))
print(rpn_to_c("2 2 +"))

Please note - the stack in the C code is no longer there, and working with it is simulated during the translation process. The stack used in the compilation process does not contain values, but variable names.


Here is the final result:


#include <stdio.h>
int main(int argc, char** argv) {
int t0 = 2;
int t1 = 2;
int t2 = t0 + t1;
printf("%d\n", t2);
return 0;
}

Results


It seems that the time of our joint activities has expired. We were committed to translating the program from one language to another. This is called a source-to-source broadcast. Or - just a translation, which can be considered synonymous with compilation, but usually the compiler translates the program from a high-level representation to a low-level one. There is another buzzword "transpilator" to refer to the source-to-source translator. But the mention of a "transpiler" can be annoying for compilers, be careful!


Experiment with the code. Waiting for feedback!


Also popular now: