zserge December 1, 2011 at 14:49

We write a primitive and useless compiler

I believe that every programmer should write his own compiler.

For a long time I myself believed that the creation of compilers is the lot of the elite, and a simple mortal programmer can not comprehend this science. I will try to prove that this is not so.

In a post we will look at how you can write your own C-like language compiler in less than an hour, writing down just 300 lines of code. As a bonus, this also includes the virtual machine code, in the bytecode of which the source will be compiled.

The compiler will be written in Python. I really love this language, but the code can be clumsy in places. If something is wrong - write in a personal.

Yes, the compiler is completely blatantly based on Tiny-C.

Grammar

Before you begin, let's decide what exactly our language will be able to do.
He will be able to do very little:

- the only data type is int
- all variables are global. There are 26 variables in total (az)
- from mathematical operations, only "+" and "-"
are supported - the only comparison operator is "<"
- from language constructs - conditional statements if, if / else, while, do / while

All. No arrays, functions, structures. Here is the grammar obtained in our language: This is a grammar record in the form of EBNF . This is what this entry roughly means. A program is a single statement.

 ::=  ::= "if"  |
                 "if"  "else"  |
                 "while"  |
                 "do"  "while"  |
                 "{" {  } "}" |
                  ";" |
                 ";"
 ::= "("  ")"
 ::=  |  "="  ::=  |  "<"   ::=  |  "+"  |  "-"  ::=  |  |    ::= "a" | "b" | ... | "z"
  ::= , {  }
 ::= "0" | "1" | ... | "9"

Operators are conditional (if..else ...), cyclic (while ...), and simply operators (eg, “a = 2 + 3”).
Conditional and cyclic contain a condition expression and a body (in the form of an operator). Normal operators end with a semicolon. You can group the operators in braces, then we get a compound operator.
Expressions are either the sum or the assignment of a variable to a value.
Here, the sum is a sequence of addition-subtraction of terms.
A term is a number, variable, or expression in parentheses.
Variables are characters a through z. Numbers are a set of numbers from 0 to 9.

All these difficulties are needed in order to tell the compiler how to automatically analyze the source code. For example, we met the word “if” - it means that the expression in parentheses follows, followed by the operator.

Lexical analyzer

The compiler receives a text file (source). A lexical analyzer is needed in order to extract tokens in this file. Those. the line “a = 3 + 42;” the lexical analyzer should present in the form “identifier: a”, “operator =”, “number 3”, “operator +”, “number 42”, “character;”. The lexical analyzer only works with a sequence of letters, i.e. for him, the line “ab if =” also makes sense and is absolutely correct.

So, our lexical analyzer should recognize the following tokens:

- numbers
- variable identifiers
- keywords: if, else, while, do
- +, -, <, =, {,}, (,) ,;
- end of file

Here is what its source code looks like:

class Lexer:
	NUM, ID, IF, ELSE, WHILE, DO, LBRA, RBRA, LPAR, RPAR, PLUS, MINUS, LESS, \
	EQUAL, SEMICOLON, EOF = range(16)
	# специальные символы языка
	SYMBOLS = { '{': LBRA, '}': RBRA, '=': EQUAL, ';': SEMICOLON, '(': LPAR,
		')': RPAR, '+': PLUS, '-': MINUS, '<': LESS }
	# ключевые слова
	WORDS = { 'if': IF, 'else': ELSE, 'do': DO, 'while': WHILE }
	# текущий символ, считанный из исходника
	ch = ' ' # допустим, первый символ - это пробел
	def error(self, msg):
		print 'Lexer error: ', msg
		sys.exit(1)
	def getc(self):
		self.ch = sys.stdin.read(1)
	def next_tok(self):
		self.value = None
		self.sym = None
		while self.sym == None:
			if len(self.ch) == 0:
				self.sym = Lexer.EOF
			elif self.ch.isspace():
				self.getc()
			elif self.ch in Lexer.SYMBOLS:
				self.sym = Lexer.SYMBOLS[self.ch]
				self.getc()
			elif self.ch.isdigit():
				intval = 0
				while self.ch.isdigit():
					intval = intval * 10 + int(self.ch)
					self.getc()
				self.value = intval
				self.sym = Lexer.NUM
			elif self.ch.isalpha():
				ident = ''
				while self.ch.isalpha():
					ident = ident + self.ch.lower()
					self.getc()
				if ident in Lexer.WORDS:
					self.sym = Lexer.WORDS[ident]
				elif len(ident) == 1:
					self.sym = Lexer.ID
					self.value = ord(ident) - ord('a')
				else:
					self.error('Unknown identifier: ' + ident)
			else:
				self.error('Unexpected symbol: ' + self.ch)

In the next_tok () method, the analyzer receives the next token. The type of token can be obtained from the sym attribute, and its value (if it is a variable or number) from the value attribute.

The analyzer ignores spaces, checks whether the current character is a special character of the language, if not, checks whether it is a number or an identifier. Those. having met digit 1, the analyzer will continue to subtract characters until it encounters a non-numeric character. Identifiers are checked in the same way (instead of letter numbers there).

Parser

This is probably the most complex component of our compiler. His task, using tokens received from the lexical analyzer, is to form a kind of tree in which the hierarchy and relations display the structure of the code. For example:

"if (a < 0) a = 5;"
IF
+-LESS
|  +-VAR(a)
|  +-NUM(0)
+-SET
  +-VAR(a)
  +-NUM(5)

A tree that is built by the parser consists of nodes. A node has a type (IF, LESS, SET, VAR ...), a value (if it is a number or a variable), and several child operand nodes (in the code - op1, op2, op3). For if, the condition and then / else branches are stored in them; for loops, the condition and body of the loop are stored.

To describe the nodes, we introduce the class Node. Here is the code for the parser class and Node class:

class Node:
	def __init__(self, kind, value = None, op1 = None, op2 = None, op3 = None):
		self.kind = kind
		self.value = value
		self.op1 = op1
		self.op2 = op2
		self.op3 = op3
class Parser:
	VAR, CONST, ADD, SUB, LT, SET, IF1, IF2, WHILE, DO, EMPTY, SEQ, EXPR, PROG = range(14)
	def __init__(self, lexer):
		self.lexer = lexer
	def error(self, msg):
		print 'Parser error:', msg
		sys.exit(1)
	def term(self):
		if self.lexer.sym == Lexer.ID:
			n = Node(Parser.VAR, self.lexer.value)
			self.lexer.next_tok()
			return n
		elif self.lexer.sym == Lexer.NUM:
			n = Node(Parser.CONST, self.lexer.value)
			self.lexer.next_tok()
			return n
		else:
			return self.paren_expr()
	def summa(self):
		n = self.term()
		while self.lexer.sym == Lexer.PLUS or self.lexer.sym == Lexer.MINUS:
			if self.lexer.sym == Lexer.PLUS:
				kind = Parser.ADD
			else:
				kind = Parser.SUB
			self.lexer.next_tok()
			n = Node(kind, op1 = n, op2 = self.term())
		return n
	def test(self):
		n = self.summa()
		if self.lexer.sym == Lexer.LESS:
			self.lexer.next_tok()
			n = Node(Parser.LT, op1 = n, op2 = self.summa())
		return n
	def expr(self):
		if self.lexer.sym != Lexer.ID:
			return self.test()
		n = self.test()
		if n.kind == Parser.VAR and self.lexer.sym == Lexer.EQUAL:
			self.lexer.next_tok()
			n = Node(Parser.SET, op1 = n, op2 = self.expr())
		return n
	def paren_expr(self):
		if self.lexer.sym != Lexer.LPAR:
			self.error('"(" expected')
		self.lexer.next_tok()
		n = self.expr()
		if self.lexer.sym != Lexer.RPAR:
			self.error('")" expected')
		self.lexer.next_tok()
		return n
	def statement(self):
		if self.lexer.sym == Lexer.IF:
			n = Node(Parser.IF1)
			self.lexer.next_tok()
			n.op1 = self.paren_expr()
			n.op2 = self.statement()
			if self.lexer.sym == Lexer.ELSE:
				n.kind = Parser.IF2
				self.lexer.next_tok()
				n.op3 = self.statement()
		elif self.lexer.sym == Lexer.WHILE:
			n = Node(Parser.WHILE)
			self.lexer.next_tok()
			n.op1 = self.paren_expr()
			n.op2 = self.statement();
		elif self.lexer.sym == Lexer.DO:
			n = Node(Parser.DO)
			self.lexer.next_tok()
			n.op1 = self.statement()
			if self.lexer.sym != Lexer.WHILE:
				self.error('"while" expected')
			self.lexer.next_tok()
			n.op2 = self.paren_expr()
			if self.lexer.sym != Lexer.SEMICOLON:
				self.error('";" expected')
		elif self.lexer.sym == Lexer.SEMICOLON:
			n = Node(Parser.EMPTY)
			self.lexer.next_tok()
		elif self.lexer.sym == Lexer.LBRA:
			n = Node(Parser.EMPTY)
			self.lexer.next_tok()
			while self.lexer.sym != Lexer.RBRA:
				n = Node(Parser.SEQ, op1 = n, op2 = self.statement())
			self.lexer.next_tok()
		else:
			n = Node(Parser.EXPR, op1 = self.expr())
			if self.lexer.sym != Lexer.SEMICOLON:
				self.error('";" expected')
			self.lexer.next_tok()
		return n
	def parse(self):
		self.lexer.next_tok()
		node = Node(Parser.PROG, op1 = self.statement())
		if (self.lexer.sym != Lexer.EOF):
			self.error("Invalid statement syntax")
		return node

This parser works recursively, starting with the parse () method.
First, he creates the “Program” node, the child node of which becomes the main operator of the program.

Operators are generated in the statement () method. This method checks the first token and, depending on it, forms an if, if / else, while, do / while, a compound statement (if it starts with a curly brace), or just a single statement ending with a semicolon.

When constructing operators, the expr () methods are used - to get the expression and paren_expr () - to get the expression in brackets.

Expressions are built from checks that are built from sums that consist of terms. And the terms, in turn, consist of numbers, variables, and expressions in parentheses. In the house that Jack built.

Such a recursion. As you can see, the methods correspond to the concepts of the grammar described above.

At the output of the parse () method, we get an object of class Node, which contains the tree of our program. This tree can now be converted to machine code.

Machine code

We will compile into a simple bytecode of our special virtual machine. The virtual machine will be stacked because they are much simpler than register. Here are its possible instructions:

FETCH x - положить на стек значение переменной x
STORE x - сохранить в переменной x значение с вершины стека
PUSH  n - положить число n на вершину стека
POP     - удалить число с вершины стека
ADD     - сложить два числа на вершине стека
SUB     - вычесть два числа на вершине стека
LT      - сравнить два числа с вершины стека (a < b). Результат - 0 или 1
JZ    a - если на вершине стека 0 - перейти к адресу a.
JNZ   a - если на вершине стека не 0 - перейти к адресу a.
JMP   a - перейти к адресу a
HALT    - завершить работу

For example, the operators “a = 2; b = 5; ”is converted to the following sequence of instructions:

PUSH 2 STORE 0 PUSH 5 STORE 1

The virtual machine code is very simple. It is all described in the run method:

IFETCH, ISTORE, IPUSH, IPOP, IADD, ISUB, ILT, JZ, JNZ, JMP, HALT = range(11)
class VirtualMachine:
	def run(self, program):
		var = [0 for i in xrange(26)]
		stack = []
		pc = 0
		while True:
			op = program[pc]
			if pc < len(program) - 1:
				arg = program[pc+1]
			if op == IFETCH: stack.append(var[arg]); pc += 2
			elif op == ISTORE: var[arg] = stack.pop(); pc += 2
			elif op == IPUSH: stack.append(arg); pc += 2
			elif op == IPOP: stack.append(arg); stack.pop(); pc += 1
			elif op == IADD: stack[-2] += stack[-1]; stack.pop(); pc += 1
			elif op == ISUB: stack[-2] -= stack[-1]; stack.pop(); pc += 1
			elif op == ILT: 
				if stack[-2] < stack[-1]:
					stack[-2] = 1
				else:
					stack[-2] = 0
				stack.pop(); pc += 1
			elif op == JZ: 
				if stack.pop() == 0:
					pc = arg
				else:
					pc += 2
			elif op == JNZ: 
				if stack.pop() != 0:
					pc = arg
				else:
					pc += 2
			elif op == JMP: pc = arg;
			elif op == HALT: break
		print 'Execution finished.'
		for i in xrange(26):
			if var[i] != 0:
				print '%c = %d' % (chr(i+ord('a')), var[i])

It remains to write the actual compiler - the code generator.

Compiler

The crown of our creation. Here is its source code:

class Compiler:
	program = []
	pc = 0
	def gen(self, command):
		self.program.append(command)
		self.pc = self.pc + 1
	def compile(self, node):
		if node.kind == Parser.VAR:
			self.gen(IFETCH)
			self.gen(node.value)
		elif node.kind == Parser.CONST:
			self.gen(IPUSH)
			self.gen(node.value)
		elif node.kind == Parser.ADD:
			self.compile(node.op1)
			self.compile(node.op2)
			self.gen(IADD)
		elif node.kind == Parser.SUB:
			self.compile(node.op1)
			self.compile(node.op2)
			self.gen(ISUB)
		elif node.kind == Parser.LT:
			self.compile(node.op1)
			self.compile(node.op2)
			self.gen(ILT)
		elif node.kind == Parser.SET:
			self.compile(node.op2)
			self.gen(ISTORE)
			self.gen(node.op1.value)
		elif node.kind == Parser.IF1:
			self.compile(node.op1)
			self.gen(JZ); addr = self.pc; self.gen(0);
			self.compile(node.op2)
			self.program[addr] = self.pc
		elif node.kind == Parser.IF2:
			self.compile(node.op1)
			self.gen(JZ); addr1 = self.pc; self.gen(0)
			self.compile(node.op2)
			self.gen(JMP); addr2 = self.pc; self.gen(0)
			self.program[addr1] = self.pc
			self.compile(node.op3)
			self.program[addr2] = self.pc
		elif node.kind == Parser.WHILE:
			addr1 = self.pc
			self.compile(node.op1)
			self.gen(JZ); addr2 = self.pc; self.gen(0)
			self.compile(node.op2)
			self.gen(JMP); self.gen(addr1);
			self.program[addr2] = self.pc
		elif node.kind == Parser.DO:
			addr = self.pc
			self.compile(node.op1)
			self.compile(node.op2)
			self.gen(JNZ); 
			self.gen(addr);
		elif node.kind == Parser.SEQ:
			self.compile(node.op1)
			self.compile(node.op2)
		elif node.kind == Parser.EXPR:
			self.compile(node.op1);
			self.gen(IPOP)
		elif node.kind == Parser.PROG:
			self.compile(node.op1);
			self.gen(HALT)
		return self.program

The gen () method adds a new byte (command or argument) to the program (list of bytes).
The compile () method compiles the entire program. In fact, this method recursively traverses the tree of nodes. and for each type of node generates the corresponding code.

Pay attention to the trick in conditional statements and loops. After JMP / JZ we first write 0, and when the branch itself is compiled and the address to which we need to go is known, so as not to execute this branch - the value 0 is changed to the actual address.

Testing

Here lies the complete compiler source. I used the script to run and check (I have Lexer read from stdin):

echo " i =	3;"  | ./cc.py
echo " { a=3; b=5; }"  | ./cc.py
echo " { a = 1; b = 2; c = a + b; }"  | ./cc.py
echo " { a = 5; b = 2; c = a - b; }"  | ./cc.py
echo " { a = 5; b = 2; c = b < a; }"  | ./cc.py
echo " { a = 5; if (a < 10) a = 33; }"  | ./cc.py
echo " { a = 5; if (10 < a) a = 33; else { a = 1; b = 2; } }"  | ./cc.py
echo " { a = 10; do { a = a - 2;}  while (3 < a); }" | ./cc.py
echo " { a = 1; b = 5; while (a < b) { a = a + 3; }}" | ./cc.py

The car seemed to give the correct answers.

What's next?

Our compiler has no applications. But experience is gained.
I hope you want to write your compiler even more. Then it’s better to start with languages with simple syntax, e.g. TinyBasic or PL / 0 .
If you want to read the source code for other compilers, I advise you to pay attention to the work Bellarda ( otcc ), tinyc , tcc , smallC , lcc .

Tags: