Writing an (under) interpreter in Haskell using alex and happy

Hello, Habr! In this article we will look at how to make a do-it-yourself (under) interpreter on Haskell. Interested please ask for a cut!

One day I got the idea to write my own interpreter, and certainly on Haskell. Writing it from scratch is not an activity for the faint of heart, and why, if everything has already been written for this by other, (possibly) more experienced people!

Step 1: wording TK for ourselves


Our (under) interpreter will work like this: Only let-in, only Int, from operations: addition, subtraction, multiplication, division (integer). Go!

let a = 2 in a*2
4
let a = 8 in (let b = a - 1 in a*b)
56



Step 2: alex


The first thing we need is a lexer (a program that will divide the code into particles - tokens). Those who wrote their translators in the great and mighty C probably remembered flex, the great and powerful lexer generator. We will use no less great and powerful alex - also a lexer generator, but for Haskell. Download alex from here or

$ sudo apt-get install alex

then open your favorite text editor and write:

{
module Lex where
}
%wrapper "basic"
$digit = 0-9
$alpha = [a-zA-Z]
tokens :-
  $white                                       ;
  let                                              { \s -> TLet }
  in                                               { \s -> TIn }
  $digit+                                     { \s -> TNum (read s)}
  [\=\+\-\*\/\(\)]                          { \s -> TSym (head s)}
  $alpha [$alpha $digit \_ \']*  { \s -> TVar s}
{
data Token = TLet | TIn | TNum Int | TSym Char | TVar String deriving (Eq, Show)
}

Scary.

In fact, everything is simple. First, we declare the Lex module (the code in curly brackets is pure Haskell), then we say that we want to use basic wrapper, that is, without any bells and whistles - cheap and cheerful, then go to the definition of character sets (charsets) - the name of charset should start with $, the value is an (almost) regular expression. $ digit is all numbers, $ alpha is all letters. Now the most important thing is tokens. after: - their definitions should go, regex on the left, token on the right. We ignore spaces (charset for whitespace on the left, semicolon on the right), let is the TLet token, in - TIn, one or more digits - TNum, all the pluses and minuses - TSym, letters, underline and '- TVar. Further we see that all the scary words with the letter T are Token values ​​- nothing complicated.

Now is the time for magic.

Save the file as Lex.x, then

$ alex Lex.x

Finish! Unas is our lexer module - Lex.hs!

Step 3: happy


Now we need a parser generator - happy. Its counterparts for C are bison and yacc. Download it from here or

$ sudo apt-get install happy

Create Synt.y file

{
module Synt where
import Lex
}
%name synt
%tokentype { Token }
%error { parseError }
%token
  let         { TLet }
  in          { TIn }
  num         { TNum $$ }
  var         { TVar $$ }
  '='         { TSym '=' }
  '+'         { TSym '+' }
  '-'         { TSym '-' }
  '*'         { TSym '*' }
  '/'         { TSym '/' }
  '('         { TSym '(' }
  ')'         { TSym ')' }
%%
Exp:
  let var '=' Exp in Exp        { Let $2 $4 $6 }
  | Exp1                        { Exp1 $1 }
Exp1:
  Exp1 '+' Term                 { Plus $1 $3 }
  | Exp1 '-' Term               { Minus $1 $3 }
  | Term                        { Term $1 }
Term:
  Term '*' Factor               { Mul $1 $3 }
  | Term '/' Factor             { Div $1 $3 }
  | Factor                      { Factor $1 }
Factor:
  num                           { Num $1 }
  | var                         { Var $1 }
  | '(' Exp ')'                 { Brack $2 }
{
parseError :: [Token] -> a
parseError _ = error "Parse error"
data Exp = Let String Exp Exp | Exp1 Exp1 deriving (Show)
data Exp1 = Plus Exp1 Term | Minus Exp1 Term | Term Term deriving (Show)
data Term = Mul Term Factor | Div Term Factor | Factor Factor deriving (Show)
data Factor = Num Int | Var String | Brack Exp deriving (Show)
}

Even worse.

Here, too, the Haskell code is taken in braces, so first we declare the Synt module, then import Lex (we need the Token type from there). "% name synt" means that our parser function will be called synt, "% tokentype {Token}" - that we use the type Token as the type of tokens would never have guessed , "% error {parseError}" defines a function that will handle errors.

Next come the tokens matching their aliases. But now it will be scary. We say that expression (Exp) is let variable = expression in expression or subexpression (Exp1). In the first case, you need to create the Let construct (its type is Exp, see the declaration below), and take the second, fourth, and sixth words (i.e. var, Exp, and Exp) as arguments, and in the second case, create the Exp1 construct with the first word as an argument. Exp1 can be an addition or subtraction operation with the first and third words as arguments or Term. Term is similar in structure, but for addition and multiplication operations. The reader will ask: “Why divide into two types, is it really impossible to cram multiplication and division in Exp1?” The fact is that the parser’s job is to build the so-called “parsing tree”. The most deeply nested operations are performed first. Term will be deeper than Exp1, which means that the operations of multiplication will be performed earlier than additions! Factor, in turn, can be a number, a variable, or an expression in parentheses - again for the order of actions. Next, we will declare a parseError function to “handle” errors and all data types like Exp or Factor.

Everything, the parser is ready!

$ happy Synt.y

Now we have the Synt.hs file

The final breakthrough: the interpreter


The interpreter code is presented below:

module Main where
import qualified Data.Map as M
import Lex
import Synt
newtype Context = Context {getContext :: M.Map String Int} deriving (Show)
pull :: Maybe a -> a
pull (Just m) = m
pull Nothing = error "Undefined variable"
createContext :: Context
createContext = Context {getContext = M.empty}
getValue :: Context -> String -> Maybe Int
getValue ctx name = M.lookup name $ getContext ctx
solveExp :: Context -> Exp -> Maybe Int
solveExp ctx exp = case exp of (Let name expl rexp) -> solveExp newCtx rexp where newCtx = Context {getContext = M.insert name (pull (solveExp ctx expl)) (getContext ctx)}
                               (Exp1 exp1) -> solveExp1 ctx exp1
solveExp1 :: Context -> Exp1 -> Maybe Int
solveExp1 ctx exp1 = case exp1 of (Plus lexp1 rterm) -> (+) <$> (solveExp1 ctx lexp1) <*> (solveTerm ctx rterm)
                                  (Minus lexp1 rterm) -> (-) <$> (solveExp1 ctx lexp1) <*> (solveTerm ctx rterm)
                                  (Term term) -> solveTerm ctx term
solveTerm :: Context -> Term -> Maybe Int
solveTerm ctx term = case term of (Mul lterm rfactor) -> (*) <$> (solveTerm ctx lterm) <*> (solveFactor ctx rfactor)
                                  (Div lterm rfactor) -> (div) <$> (solveTerm ctx lterm) <*> (solveFactor ctx rfactor)
                                  (Factor factor) -> solveFactor ctx factor
solveFactor :: Context -> Factor -> Maybe Int
solveFactor ctx factor = case factor of (Num n) -> (Just n)
                                        (Var s) -> getValue ctx s
                                        (Brack exp) -> solveExp ctx exp
main = do
       s <- getContents
       mapM putStrLn $ (map (show . pull . (solveExp createContext) . synt . alexScanTokens) . lines) s

Here we declare a Context type that contains an associative array (Map) with variable names and their values, a pull function that returns the value of the variable, if any, raises an error. The createContext function simply creates an empty context, getValue looks for the value of the variable in the context. Now the fun part! To understand what is happening here, let’s imagine that the line of our code is as follows:

8

Then the parsing tree will be like this:

let res = Exp (Exp1 (Term (Num 8)))

and the result (i.e. 8) will be after

((solveFactor ctx) <- (solveTerm ctx) <- (solveExp1 ctx) <- (solveExp ctx)) res

This is not Haskell code, but the scheme: solveExp will pass res solveExp1, etc.
ctx here is just some context.

That is, the work of the solveExp function is to add the variable to the context and calculate Exp after in, if it goes to the input of the let-in function, otherwise, simply calculate Exp1.
The function solveExp1 adds or subtracts, solveTerm - multiplies and divides. solveFactor takes out the values ​​of the variables, returns the numbers, and if Exp in parentheses goes to it, it passes solveExp (not again but again) .

The main function takes stdin strings to EOF, splits them into a list of strings, selects tokens in each (alexScanTokens), runs it through a parser (synt), calculates the value of an expression (solveExp) with an empty context (createContext), and makes Maybe Int Int. and then String, after which it displays the result.

All! Now for sure! We compile everything, and our interpreter is ready! Comments, comments, suggestions - please comment.

Also popular now: