xkrt March 10, 2011 at 11:25

(How to write (Lisp) interpreter (in Python))

Transfer

Translation of the article (How to Write a (Lisp) Interpreter (in Python)) by Peter Norwig. In this article, Peter quite briefly but succinctly describes how interpreters work and shows how you can write a very tiny (only 90 lines in Python) interpreter of a subset of the Scheme language. The translation is published with permission of the author.

Peter Norvig is an American computer scientist. He currently works as Research Director (formerly Director of Search Quality) at Google. He was previously Head of the Computer Engineering Division at NASA's Ames Research Center, where he led the staff of two hundred NASA researchers in the fields of anatomy and robotics, computer-aided software development and data analysis, neuroengineering, collective system design, and simulation-based decision making.

This article has two purposes: to describe in general terms the implementation of interpreters of programming languages, as well as to demonstrate the implementation of a subset of the Scheme language , a Lisp dialect using Python. I named my interpreter Lispy ( lis.py ). A year ago, I showed how to write a Scheme interpreter in Java , as well as in Common Lisp . This time, the aim is to demonstrate, concise and accessible as possible, that Alan Kay called "Maxwell's Equations Software» (Maxwell's Equations of Software).

Syntax and semantics of Lispy

Most programming languages have different syntactic conventions (keywords, infix operators, parentheses, operator precedence, dot notation, semicolons, etc.), but as a member of the Lisp language family, all Scheme syntax is based on lists in parenthesized prefix notation. This form may seem unfamiliar, but it has advantages in simplicity and consistency. Some joke that “Lisp” means “Lots of Irritating Silly Parentheses,” I think it means “ Lisp Is Syntactically Pure.” See:

Java		Scheme
`if (x.val() > 0) { z = f(a * x.val() + b); }`		`(if (> (val x) 0) (set! z (f (+ (* a (val x)) b))))`

Note that the exclamation mark is not a special character in Scheme, it is just part of the name " set!". Only brackets are special characters. A list, such as (set! x y), with a special keyword at the beginning is called a special form in Scheme; the beauty of the language lies in the fact that we need only 6 special forms, plus 3 syntactic constructions - variables, constants and procedure calls:

The form	Syntax	Semantics and Example
variable reference	var	The symbol is interpreted as the name of the variable; its value is the value of the variable. Example:`x`
constant literal	number	A number means itself. Examples: `12`or`-3.45e+6`
quote	`(quote` exp`)`	Returns exp literally without evaluating it. Example:`(quote (a b c)) ⇒ (a b c)`
conditional	`(if`test conseq alt`)`	Calculate test ; if true, calculate and return conseq ; otherwise compute and return alt . Example:`(if (< 10 20) (+ 1 1) (+ 3 3)) ⇒ 2`
assignment	`(set!`var exp`)`	Compute exp and assign the value to the var variable , which must be previously defined (using `define`or as a procedure parameter). Example:`(set! x2 (* x x))`
definition	`(define`var exp`)`	Define a new variable in the nested environment itself and give it the value of evaluating the expression exp . Examples: `(define r 3)`or`(define square (lambda (x) (* x x)))` .
function	`(lambda (`var ... `)`exp`)`	Create a function with parameter (s) with the names var ... and an expression in the body. Example:`(lambda ( r ) (* 3.141592653 (* r r)))`
sequencing	`(begin` exp ...`)`	Compute each expression from left to right and return the last value. Example:`(begin (set! x 1) (set! x (+ x 1)) (* x 2)) ⇒ 4`
procedure call	`(`proc exp ...`)`	If proc is not `if, set!, define, lambda, begin,`or `quote`then it is considered as a procedure. It is calculated according to the same rules that are defined here. All expressions are evaluated, and then the function is called with a list of expressions in the argument. Example:`(square 12) ⇒ 144`

In this table, var must be an identifier character, such as xor square; number - must be an integer or a floating-point number, while the rest of the words in italics can be any expression. The designation exp ... means zero or more exp repetitions .

To learn more about the scheme please refer to the following excellent books ( Friedman and Fellesein , Dybvig , Queinnec , Harvey and Wright or Sussman and Abelson ), video ( Abelson and Sussman ), introductions ( Dorai , the PLT , orNeller ), or the reference guide .

What the language interpreter does

The interpreter consists of two parts:

Parsing: the parsing component receives the program input as a sequence of characters, checks it in accordance with the syntax rules of the language, and translates the program into the internal representation. In a simple interpreter, the internal representation looks like a tree structure, which accurately reflects the structure of nested operators and expressions in the program. In a language translator called a compiler , an internal representation is a sequence of instructions that can be directly executed by a computer. As Steve Yegge said , “If you don’t know how compilers work, then you don’t know how computers work”. Egg described 8 scenarios that can be solved using compilers (or, equivalently, interpreters). The Lispy parser is implemented using a function parse.
Execution: the internal representation is processed in accordance with the semantic rules of the language, thereby performing calculations. Execution is performed by a function eval(note that it hides the Python built-in function).

Here is the picture of the process of interpretation and interactive session showing how parseand evaltreated with a short program:

>> program = "(begin (define r 3) (* 3.141592653 (* r r)))"

>>> parse (program)
['begin', ['define', 'r', 3], ['*' , 3.141592653, ['*', 'r', 'r']]]

>>> eval (parse (program))
28.274333877

We use the simplest internal representation, where lists, numbers, and Scheme characters are represented by Python lists, numbers, and strings, respectively.

Performance: `eval`

Each of the nine cases in the table above has one, two, or three lines of code. evalNothing more is needed to determine :

def eval (x, env = global_env):
    "Evaluate an expression in an environment."
    if isa (x, Symbol): # variable reference
        return env.find (x) [x]
    elif not isa (x, list): # constant literal
        return x
    elif x [0] == 'quote': # (quote exp )
        (_, exp) = x
        return exp
    elif x [0] == 'if': # (if test conseq alt)
        (_, test, conseq, alt) = x
        return eval ((conseq if eval (test, env ) else alt), env)
    elif x [0] == 'set!': # (set! var exp)
        (_, var, exp) = x
        env.find (var) [var] = eval (exp, env )
    elif x [0] == 'define': # (define var exp)
        (_, var, exp) = x
        env [var] = eval (exp, env)
    elif x [0] == 'lambda': # ( lambda (var *) exp)
        (_, vars, exp) = x
        return lambda * args: eval (exp, Env (vars, args, env))
    elif x [0] == 'begin': # (begin exp * )
        for exp in x [1:]:
            val = eval (exp, env)
        return val
    else: # (proc exp *)
        exps = [eval (exp, env) for exp in x]
        proc = exps.pop (0)
        return proc (* exps)

isa = isinstance

Symbol = str

That's all you need eval! .. well, except for environments. Environment is simply a mapping of characters to values belonging to them. New symbol / value relationships are added using defineor ( lambdaexpression).

Let's take a look at an example of what happens when we declare and call a function in Scheme (a hint lis.py>means that we are interacting with the Lisp interpreter, not Python):

lis.py> (define area (lambda (r) (* 3.141592653 (* r r))))
lis.py> (area 3)
28.274333877

When we execute (lambda ( r ) (* 3.141592653 (* r r))), we get to the branch elif x[0] == 'lambda'in eval, assign the (_, vars, exp)corresponding list elements to the three variables x(and signal an error if the length is xnot 3). We create a new function, which, when called, will calculate the expression ['*', 3.141592653 ['*', 'r', 'r']]in the environment formed by a bunch of formal parameters of the function (in this case, only one parameter r) with arguments in the function call, and also using the current environment for any variables that are not parameters (in example, variable *). The value of this newly-created function is then assigned as value areain global_env.

Now what happens when we calculate(area 3)? Because area is not a special form of characters, so it is a function call (last else:in eval), and the entire list of expressions is executed at a time. The calculation areareturns the function we just created; calculation 3returns 3. Then (in accordance with the last line eval), the newly created function with the argument list is [3]called. This means execution exp, which ['*', 3.141592653 ['*', 'r', 'r']]in an environment where the external environment is requal 3and is global, and therefore *is a function of multiplication.

Now we are ready to explain the details of the class Env:

class Env (dict):
    "An environment: a dict of {'var': val} pairs, with an outer Env."
    def __init __ (self, parms = (), args = (), outer = None):
        self.update (zip (parms, args))
        self.outer = outer
    def find (self, var):
        "Find the innermost Env where var appears. "
        return self if var in self else self.outer.find (var)

Note that the class Envis a subclass dict, which means that normal dictionary operations work on it. In addition, there are two methods, the constructor __init__and findfor finding the right environment for the variable. The key to understanding this class (as well as to the reason that we do not use it simply dict) is the concept of the external environment. Consider the following program:

Each rectangle represents an environment, and its color corresponds to the color of the variables that were defined in this environment. In the last two lines, we define a1and invoke (a1 -20.00); these two lines represent the creation of a bank account with a balance of $ 100 and the subsequent withdrawal of $ 20. In progress(a1 -20.00)we evaluate the expression highlighted in yellow. There are 3 variables involved in this expression. A variable amtcan be found immediately in the most nested (green) environment. But balancenot defined here, we will look in the external environment for a relatively green, i.e. in the blue. And finally, the variable was +not found in any of these environments, we need to take another step into the global (red) environment. This search process, from internal to external media pacing called the lexical definition of context (lexical scoping). Procedure.findfinds the correct environment in accordance with the lexical scope.

All that remains is to define the global environment. Must have+and all other functions built into Scheme. We will not worry about implementing them all, we will get a bunch of the necessary ones by importing the module mathinto Python and then explicitly adding 20 popular ones:

def add_globals (env):
    "Add some Scheme standard procedures to an environment."
    import math, operator as op
    env.update (vars (math)) # sin, sqrt, ...
    env.update (
     {'+': op.add, '-': op.sub, '*': op. mul, '/':op.div,' not ': op.not_,
      '> ': op.gt,' <': op.lt,'> = ': op.ge,' <= ': op. le, '=': op.eq,
      'equal?': op.eq, 'eq?': op.is_, 'length': len, 'cons': lambda x, y: [x] + y,
      ' car ': lambda x: x [0],' cdr ': lambda x: x [1:],' append ': op.add,
      ' list ': lambda * x: list (x),' list? ': lambda x: isa (x, list),
      'null?': lambda x: x == [], '

Parsing: `read`and`parse`

Now about the function parse. The analysis is traditionally divided into two parts: lexical analysis , during which the input string of characters is divided into a sequence of tokens , and parsing , during which the tokens are collected in an internal representation. Lispy tokens are brackets, characters (such as set!or x), and numbers. Here's how it works:

>>> program = "(set! x * 2 (* x 2))"

>>> tokenize (program)
['(', 'set!', 'x * 2', '(', '*', 'x', '2', ')', ')']

>>> parse (program)
['set!', 'x * 2', ['*', 'x', 2]]

There are a bunch of tools for lexical analysis (for example, lex by Mike Lesk and Eric Schmidt), but we will go the simple way: we use Python str.split. We just add spaces around each bracket, and then we get a list of tokens by calling str.split.

Now about the parsing. We saw that the Lisp syntax is very simple, but some Lisp interpreters have made parsing even easier by accepting any string of characters that is a list as a program . In other words, the string (set! 1 2)will be accepted by the syntactically correct program, and only at run time will the interpreter swear that it set!expects a character, not a number, as the first argument. In Java or Python, equivalent assignment1 = 2, will be recognized as a compile time error. On the other hand, Java and Python do not require detection of errors in the expression during compilation x/0, therefore, as you can see, it is not always strictly defined when errors should be recognized. Lispy implements parseusing read, a function that reads any expression (number, character or nested list).

The function readworks by transmitting read_fromtokens obtained after lexical analysis. Having a list of tokens, we look at the first token; if it ')'is a syntax error. If this is the case '(', then we begin to build a list of expressions until we get to the corresponding')'. Everything else should be characters or numbers, which are complete expressions in themselves. The final trick is to know what '2'represents an integer, '2.0'a floating-point number, and xrepresents a character. We will let Python make these differences: for each non-bracketed token, first try to interpret it as an integer, and then as a floating-point number, and finally as a character. Following these instructions we get:

def read (s):
    "Read a Scheme expression from a string."
    return read_from (tokenize (s))

parse = read

def tokenize (s):
    "Convert a string into a list of tokens."
    return s.replace ('(', '(') .replace (')', ')') .split ()

def read_from (tokens):
    "Read an expression from a sequence of tokens."
    if len (tokens) == 0:
        raise SyntaxError ('unexpected EOF while reading')
    token = tokens.pop (0)
    if '(' == token:
        L = []
        while tokens [0]! = ')':
            L.append (read_from (tokens))
        tokens.pop (0) # pop off ')'

        raise SyntaxError ('unexpected)')
    else:
        return atom (token)

def atom (token):
    "Numbers become numbers; every other token is a symbol."
    try: return int (token)
    except ValueError:
        try: return float (token)
        except ValueError:
            return Symbol (token)

And finally, we will add a function to_stringin order to convert the expression back to a readable Lisp string, as well as the function replthat is needed for the read-eval-print cycle in the form of an interactive interpreter:

def to_string (exp):
    "Convert a Python object back into a Lisp-readable string."
    return '(' + '' .join (map (to_string, exp)) + ')' if isa (exp, list) else str (exp)

def repl (prompt = 'lis.py>'):
    "A prompt- read-eval-print loop. "
    while True:
        val = eval (parse (raw_input (prompt)))
        if val is not None: print to_string (val)

Here is our code in work:

>>> repl()
lis.py> (define area (lambda (r) (* 3.141592653 (* r r))))
lis.py> (area 3)
28.274333877
lis.py> (define fact (lambda (n) (if (<= n 1) 1 (* n (fact (- n 1))))))
lis.py> (fact 10)
3628800
lis.py> (fact 100)
9332621544394415268169923885626670049071596826438162146859296389521759999322991
5608941463976156518286253697920827223758251185210916864000000000000000000000000
lis.py> (area (fact 10))
4.1369087198e+13
lis.py> (define first car)
lis.py> (define rest cdr)
lis.py> (define count (lambda (item L) (if L (+ (equal? item (first L)) (count item (rest L))) 0)))
lis.py> (count 0 (list 0 1 2 3 0 0))
3
lis.py> (count (quote the) (quote (the more the merrier the bigger the better)))
4

How small / fast / complete / good is Lispy?

We will judge Lispy by the following criteria:

Size: Lispy tiny, 90 lines without comment, less than 4K source code. There are fewer lines compared to the first version, which had 96 lines - I accepted Eric Cooper's offer to remove the class definitionProcedureand use the Python one insteadlambda. The smallest Java version of my Scheme ( Jscheme ) consisted of 1664 lines and 57K of source code. Jsceheme was originally called SILK (Scheme in Fifty Kilobytes, 50 kilobyte Scheme), but I met this limit only with respect to the resulting bytecode. I think this satisfies the1972 Alan Kay requirement that you could define “the most powerful language in the world” in the “code page” .
bash $ grep "^ \ s * [^ # \ s]" lis.py | wc
90 398 3423
Speed: Lispy calculates(fact 100)in 0.004 seconds. For me, this is fast enough (although much slower than most other ways to count it).
Completion: Lispy does not fully comply with the Scheme standard. Here are the main disadvantages:
- Syntax: missing comments, quoting / quasi-quoting, # literals, derived expression types (such ascondderived fromif, orletderived fromlambda), and dotted list notation.
- Semantics: missing call / cc and tail recursion.
- Data types: no strings, single characters, boolean type, ports, vectors, exact / inaccurate numbers. Lists in Python are actually closer to vectors in Scheme than to pairs and lists.
- Functions: there are no more than 100 primitive functions, most of which are designed for missing data types, plus some others (such asset-car!andset-cdr!, because we can notset-cdr!fully implement Python lists).
- Error handling: Lispy does not try to detect errors, make an acceptable report of them, or recover. Lispy expects programmer excellence.
Good: it's up to readers. I find it good for my purpose to explain Lisp interpreters.

True story

Returning to the idea that knowing how interpreters work can be very useful, I’ll tell a story. In 1984, I wrote a doctoral dissertation. This was before LaTeX, before Microsoft Word - we used troff. Unfortunately, troff had no way of referencing symbolic labels: I wanted to be able to write “As we will see on the @theoremx page” and then write something like “@ (set theoremx \ n%)” in the appropriate place ( troff reserved \ n% as page number). My friend Tony Deros felt the same need, and together we sketched a simple Lisp program that worked as a preprocessor. However, it turned out that the Lisp we had at that time was good for reading Lisp expressions, but was so slow for reading everything else that it was annoying.

Tony and I went in different ways. He believed that the interpreter for expressions was the difficult part, he needed Lisp for this, but he knew how to write a small C subroutine to repeat non-Lisp characters and associate them with a Lisp program. I did not know how to make this binding, but I believed that writing an interpreter for this trivial language (all that was in it - setting a variable value, getting a variable value and concatenating strings) was a simple task, so I wrote this interpreter in C. It's funny that Tony wrote a program in Lisp because I was a C programmer, and I wrote a program in C because I was a Lisp programmer.

In the end, we both achieved a result.

Together

The full Lispy code is available for download: lis.py

Tags: