Haskell applicative parsers

Tutorial

Motivation

When I first started to learn Haskell, I was very annoyed by the widespread use of complex abstractions instead of some specific solutions. It seemed to me that it would be much better to always follow the KISS principle and write bicycles using elementary language constructions than to sort through all these types of classes in order to write one supposedly convenient structure somewhere in the end.

I lacked a good example where the efforts spent on mastering the "materiel" would pay off. For me, one of the most successful such examples was the parsers. Now I often talk about them when they ask me for which common tasks you can beautifully use Haskell.

I want to suggest that beginners also go through this path and create from scratch a small base of functions for conveniently implementing parsers, and then use it to write your own parser, the code of which will almost literally repeat the grammar that is being analyzed.

I hope someone can help it to overcome the fear of abstractions and teach appropriate to use them (yes, I still believe that it is sometimes more effective to write a bicycle).

I have no goal and no desire to make a Haskell course out of the article from scratch, so I assume that the reader is familiar with the syntax and independently developed simple programs. Just in case, I will briefly talk about the type classes before proceeding to the description of the implementation.

For those who have never written on Haskell, but want to understand what is going on here, I recommend that you first look at the corresponding page on Learn X in Y minutes . As an excellent Russian-language book for beginners, I advise Denis Shevchenko, "About Haskell Humanly" .

I will try to use the most simple language constructs that beginners can understand. At the end of the article there is a link to the source repository, where in some parts of the code a more convenient and short record is used, which may be less clear at a glance.

And yes, gentlemen, Haskelists, many things are explained very simply and clumsily, for particular cases, not very abstract, without the use of terms from category theory and other scary words. I am glad that you know them and of course easily mastered them. I also know them, but I do not consider it necessary to throw out such a volume of information in this context to unprepared readers.

Type Classes

Type classes in Haskell have nothing to do with classes in C ++ and other object-oriented languages. If we draw an analogy with OOP, the types of classes are more like the overloading of methods and functions.

Classes define what actions can be performed with objects of the types that are included in the class. For example, all numbers can be compared by equality, but everything can be ordered except complex ones, and functions in general cannot be compared at all. The class of types that can be compared, called Eq, ordered - Ord(types do not have to be numeric). What can be printed by translating into a string belongs to a class Show, it has an “opposite” class Readthat determines how to convert strings to objects of the desired type.

For a set of standard types of classes (such as Eq, Show, Read...) you can ask the compiler to implement the desired functionality in the standard way using a keyword derivingafter determining the type:

dataPoint = Point
    { xCoord :: Float
    , yCoord :: Float
    } deriving (Eq, Show)

You can define your own type classes:

classPrettyPrint a where
  pPrint :: a -> String

Here PrettyPrintis the class name, ais the type variable. The keyword is wherefollowed by a list of the so-called class methods, i.e. functions that can be applied to objects of type from this class.

In order to denote the belonging of a data type to a class, the following construction is used:

instancePrettyPrintPointwhere
  pPrint (Point x y) = "(" ++ show x ++ ", " ++ show y ++ ")"

The language allows you to specify restrictions on the types of classes to which the function arguments should refer:

showVsPretty :: (Show a, PrettyPrint a) => a -> (String, String)
showVsPretty x = (show x, pPrint x)

For each function call, the compiler checks whether these requirements for the type are met, and if it fails, it displays an error (of course, this happens at the compilation stage).

>>> showVsPretty (Point23)
("Point {xCoord = 2.0, yCoord = 3.0}","(2.0, 3.0)")
>>> showVsPretty "str"error:
    Noinstance for (PrettyPrint [Char]) arising from a use of ‘showVsPretty’

Implementation

The parser receives as input a string that should be parsed according to predefined rules and get the value of the type we need (for example, an integer). In this case, the input line may not end, and the remainder will serve as an input for further parsing. In addition, our parser will generally be non-deterministic, i.e. will return several possible parse results as a list.

For the description of one result of the parser, a tuple of two elements is suitable (String, a), where ais a type variable that can denote any custom type.

Since the parser parses the string according to some rules, we describe it as a function that takes a string as input and returns a list of results:

newtypeParser a = Parser { unParser :: String -> [(String, a)] }

We will consider the parsing successful if the list of results consists of one element and the input string has been completely processed. We implement an auxiliary function that attempts to perform an unambiguous parsing of the entire string:

parseString :: String -> Parser a -> Maybe a
parseString s (Parser p) = case (p s) of
    [("", val)] -> Just val
    _           -> Nothing

Simple parsers

We implement several simple parsers, which will then come in handy in building more complex combinations.

We begin by parsing a single character that must satisfy the predicate. If the input string is empty, then the result of the work is an empty list. Otherwise, check the value of the predicate on the first character of the string. If the value is returned True, the parsing result is this character; return it along with the rest of the string. Otherwise, parsing also ends in failure.

predP :: (Char -> Bool) -> ParserCharpredP p = Parser f
  where
    f "" = []
    f (c : cs) | p c = [(cs, c)]
               | otherwise = []

Now we can write a parser that accepts a specific character at the beginning of a line. To do this, use the newly written one predPand pass it to it as an argument a function that compares its argument with the symbol we need:

charP :: Char -> ParserCharcharP char = predP (\c -> c == char)

The following simplest case: a parser that accepts only a specific string entirely. Let's call him stringP. The function inside the parser compares the input string with the required one and, if the lines are equal, returns a list of one element: a pair of empty lines (nothing left at the input) and the original one. Otherwise, the parsing failed, and an empty list of results is returned.

stringP :: String -> ParserStringstringP s = Parser f
  where
    f s' | s == s' = [("", s)]
         | otherwise = []

Quite often, you need to skip characters that have a certain property while they are at the beginning of a line (for example, whitespace characters). At the same time, the result of the analysis is not important to us and will not be useful in the future. Let's write a function skipthat skips the initial characters of the string, while the true value of the predicate is preserved. As a result of the analysis we use an empty tuple.

skip :: (Char -> Bool) -> Parser ()
skip p = Parser (\s -> [(dropWhile p s, ())])

The following two parsers are very similar to each other. Both check the input string prefix, only the first returns the prefix if successful, and the second returns an empty tuple, i.e. allows you to skip an arbitrary string at the beginning of the entry. For implementation, the function isPrefixOfdefined in the module is used Data.List.

prefixP :: String -> ParserStringprefixP s = Parser f
  where
    f input = if s `isPrefixOf` input
                then [(drop (length s) input, s)]
                else []
skipString :: String -> Parser ()
skipString s = Parser f
  where
    f input = if s `isPrefixOf` input
                then [(drop (length s) input, ())]
                else []

A little later, we will look at a simpler implementation of the last function and get rid of code duplication.

Parser as a functor

We can distinguish a whole class of container types for which the following is true: if you know how to transform objects inside a container, then you can convert the containers themselves. The simplest example is a list as a container and a function mapthat exists in almost all high-level languages. Indeed, you can go through all the elements of a type list [a], apply a function to each a -> band get a type list [b].

This type class is called Functor; the class has one method fmap:

classFunctor f where
  fmap :: (a -> b) -> f a -> f b

Suppose that we already know how to parse strings into objects of a certain type a, and, moreover, we know how to convert objects of a type ainto objects of a type b. Is it possible to say that then there is a parser for type objects b?

If we express this in the form of a function, then it will have the following type:

(a -> b) -> Parser a -> Parser b

This type coincides with the type of the function fmap, so we will try to make the parser a functor. Create a parser of type values from scratch bthat will first call the first parser (we already have one), and then apply the function to the results of its parsing.

instanceFunctorParserwhere
  fmap :: (a -> b) -> Parser a -> Parser b
  fmap f (Parser p1) = Parser p2
    where
      p2 :: String -> [(String, b)]
      p2 s = convert (p1 s)
      convert :: [(String, a)] -> [(String, b)]
      convert results = map (\(s, val) -> (s, f val)) results

At the function fmapthere is a convenient infix synonym: fmap f x == f <$> x.

If, as an argument, we fmapuse a function that simply replaces its first argument with a new value, we will get another useful operation, which has already been implemented for all functors even in two copies (they differ only in the order of the arguments):

(<$) :: Functor f => a -> f b -> f a
($>) :: Functor f => f a -> b -> f b

Remember the parser that skips a certain string ( skipString)? Now you can implement it as follows:

skipString :: String -> Parser ()
skipString s = () <$ prefixP s

Parser Combinations

In Haskell, all functions are curried by default and allow partial use. This means that the function of the narguments is actually a function of one argument, which returns a function of the n-1arguments:

cons :: Int -> [Int] -> [Int]
cons = (:)
cons1 :: [Int] -> [Int]
cons1 = cons 1-- функция cons применена частично

Apply the function of three arguments to some value inside the parser, using fmap. Types will be as follows:

f :: c -> a -> b
p :: Parser c
(fmap f p) :: Parser (a -> b)

The parser of function has turned out ?! Of course, it is possible that the input line actually contains the function representation, but I would like to be able to use this function, or rather combine parsers, Parser (a -> b)and Parser ato get Parser b:

applyP :: Parser (a -> b) -> Parser a -> Parser b

The type of this function is very similar to the type fmap, only the function itself that needs to be applied is also in the container. This gives an intuitive understanding of how the implementation of the function should look like applyP: get the function from the container (as a result of applying the first parser), get the values to which the function should be applied (the result of applying the second parser) and “pack” the values converted using this function back to container (create a new parser). In the implementation we will use list comprehension:

applyP :: Parser (a -> b) -> Parser a -> Parser b
applyP (Parser p1) (Parser p2) = Parser f
    where f s = [ (sx, f x) | (sf, f) <- p1 s,  -- p1 применяется к исходной строке
                              (sx, x) <- p2 sf] -- p2 применяется к строке, оставшейся после предыдущего разбора

There is a class Applicativethat has a method with the same prototype. The second class method is called pureand is used to “wrap” or “lift” ( lift ) a value, including a functional one. In the case of implementation for the parser, the function pureadds its argument to the result of the parser, without changing the input string.

classFunctor f => Applicative f where
  pure :: a -> f a
  (<*>) :: f (a -> b) -> f a -> f b
instanceApplicativeParserwhere
  pure x = Parser (\s -> [(s, x)])
  pf <*> px = Parser (\s -> [ (sx, f x) | (sf, f) <- unParser pf $ s,
                                          (sx, x) <- unParser px $ sf])

The function applyPis <*>from the class Applicative. Types belonging to this class are called applicative functors.

For applicative functors, two auxiliary functions are implemented that will be useful to us:

(*>) :: f a -> f b -> f b
(<*) :: f a -> f b -> f a

These functions perform two consecutive actions and return the result of only one of them. For parsers, they can be used, for example, in order to skip leading gaps before parsing the part of the line that carries the meaning.

By combining <$>and <*>creating very convenient designs. Consider the following data type:

dataMyStructType = MyStruct
  { field1 :: Type1
  , field2 :: Type2
  , field3 :: Type3
  }

The value constructor MyStructis also a function, in this case it has a type Type1 -> Type2 -> Type3 -> MyStructType. You can work with the constructor as with any other function. Suppose that parsers are already written for the structure field types:

parser1 :: ParserType1parser2 :: ParserType2parser3 :: ParserType3

Using the function, fmapyou can partially apply MyStructto the first of these parsers:

parserStruct' :: Parser (Type2 -> Type3 -> MyStructType)
parserStruct' = MyStruct <$> parser1

Let's try to continue to use the function, which is now "inside" the parser. For this you need to use <*>:

parserStruct'' :: Parser (Type3 -> MyStructType)
parserStruct'' = parserStruct' <*> parser2
parserStruct :: ParserMyStructTypeparserStruct = parserStruct'' <*> parser3

As a result, we got a parser for the whole structure (of course, here we use the assumption that in the initial line of the presentation of its fields go in a row). The same can be done in one line:

parserStruct :: ParserMyStructTypeparserStruct = MyStruct <$> parser1 <*> parser2 <*> parser3

Such designs will often be found in the example of use.

Now suppose that we are trying to write a parser that parses simple arithmetic expressions in which integers and identifiers can be present as operands. Create a separate type for them Operand:

dataOperand
  = IntOpInt
  | IdentOpString

If we can already parse integers and identifiers (for example, as in C), then we need one parser for operands that can parse one or the other. This parser is an alternative of the other two, so we need a function that can combine parsers so that the results of their work are combined. The result of the parser is a list, and the union of lists is their concatenation. We implement the function altPcombining two parsers:

altP :: Parser a -> Parser a -> Parser a
altP (Parser p1) (Parser p2) = Parser (\s -> p1 s ++ p2 s)

Then the operand parser can be implemented using this function (here it is assumed that parserIntthey are parserIdentalready described somewhere:

parserOperand :: ParserOperandparserOperand = altP parserIntOp parserIdentOp
  where
    parserIntOp = IntOp <$> parserInt
    parserIdentOp = IdentOp <$> parserIdent

Of course, for the alternatives have already come up with a separate class, which is called Alternative. It has another method empty, describing a neutral element for an alternative operation. In our case, this is a parser that never parses anything, i.e. always returns an empty list of results. For the parser, the implementation of the class methods Alternativelooks like this:

classApplicative f => Alternative f where
  empty :: f a
  (<|>) :: f a -> f a -> f a
instanceAlternativeParserwhere
  empty = Parser (const [])
  px <|> py = Parser (\s -> unParser px s ++ unParser py s)

An operation <|>is a function altP, only in the infix notation, which is more convenient to use, combining several parsers in a row.

For all types in this class two functions are implemented, someand manytypes f a -> f [a]. Each of them can be expressed through the other:

some v = (:) <$> v <*> many v
many v = some v <|> pure []

In terms of parsers, these functions allow you to parse data sequences, if you know how to parse one data element. If used, the somesequence must be non-empty.

Usage example

Now we are ready to write our parser, for example, for simple arithmetic expressions with the following grammar:

 expr      ::= constExpr | binOpExpr | negExpr
 const     ::= int
 int       ::= digit{digit}
 digit     ::= '0' | ... | '9'
 binOpExpr ::= '(' expr ' ' binOp ' ' expr ')'
 binOp     ::= '+' | '*'
 negExpr   ::= '-' expr

The expression consists of integer constants, unary minus and two infix binary operations: addition and multiplication. Brackets are required around an expression with a binary operation, the operation symbol is separated from the operands by exactly one space, leading and trailing spaces are not allowed.

Examples of correct expression writing:

"123"
"-(10 + 42)"
"(1 + ((2 + 3) * (4 + 5)))"

Examples of incorrect entries:

" 666 "
"2 + 3"
"(10  * 10)"

We declare the necessary data types (the expression itself and the binary operation):

dataExpr = ConstExprInt
          | BinaryExprExprOperatorExpr
          | NegateExprExprdataOperator = Add | Mul

You can start parsing! The expression itself consists of three alternatives. So we write:

-- expr ::= constExpr | binOpExpr | negExprexprParser :: ParserExprexprParser = constParser <|> binParser <|> negParser

The constant is a positive integer. In our data type, it is "wrapped" in the constructor, so we cannot use the parser for an integer directly, but we can use fmapit to get the value of the desired type.

-- const ::= intconstParser :: ParserExprconstParser = ConstExpr <$> intParser

The integer, according to the grammar, is represented as a non-empty sequence of numbers. To parse a single digit, we use an auxiliary function predPand a predicate isDigitfrom a module Data.Char. Now, to build a parser to parse a sequence of numbers, we use the function some(not many, because there must be at least one digit). The result of this parser returns a list of all possible parsing options, starting with the longest entry. For example, if the input string "123ab", the results list will be as follows: [("ab", "123"), ("3ab", "12"), ("23ab", "1")]. We need to parse the longest sequence of numbers and convert it to a type Int. The whole implementation is as follows:

-- int   ::= digit{digit}-- digit ::= '0' | ... | '9'intParser :: ParserIntintParser = Parser $ \s -> let res = unParser (some digitParser) s incase res of
      [] -> []
      ((rest, i) : xs) -> [(rest, read i)]
  where
    digitParser = predP isDigit

The next variant of the expression is the use of a binary operation. According to the grammar, the input string must first include an opening bracket, the first operand, a space, an operation symbol, another space, the second operand, and a closing bracket. To parse individual characters (brackets and spaces) use the function charP. Operands are expressions, and there is already a parser to parse them ( exprParser). To parse the binary operation symbol, we will describe the auxiliary parser just below. It remains to carefully combine this set of parsers. At the beginning and at the end of the expression there should be brackets: you need to check this, but discard the result itself. For this we use *>and <*:

binParser :: ParserExprbinParser = charP '(' *> ??? <* charP ')'

Between these parsers for brackets, the construction of the expression with the help of the constructor BinaryExprand the parsers for the expression and operation should occur . Let's not forget about the spaces around the operation symbol, using the same method as for parentheses. This part is as follows:

BinaryExpr <$> exprParser -- первый операнд
           <*> (charP ' ' *> binOpParser <* charP ' ') -- операция, окружённая пробелами
           <*> exprParser -- второй операнд

Substitute this expression instead of question marks:

-- binOpExpr ::= '(' expr ' ' binOp ' ' expr ')'binParser :: ParserExprbinParser =
  charP '(' *>
    (BinaryExpr <$> exprParser
                <*> (charP ' ' *> binOpParser <* charP ' ')
                <*> exprParser
    )
  <* charP ')'

A binary operation is either a symbol +that understands the value Add, or *that understands Mul:

-- binOp ::= '+' | '*'binOpParser :: ParserOperatorbinOpParser = plusParser <|> multParser
  where
    plusParser = charP '+' $> Add
    multParser = charP '*' $> Mul

Remained the simplest part of the grammar, the negation of expression. We -do the same with the symbol as with brackets and spaces. Next, apply the constructor NegateExprto the result of recursive parsing:

-- negExpr ::= '-' exprnegParser = charP '-' *> (NegateExpr <$> exprParser)

So, all parts of the parser are implemented. The code in many ways resembles a grammar and completely coincides with it in structure.

The source code is available on GitLab: https://gitlab.com/fierce-katie/applicative-parsers-demo .

It is easier to estimate its volume and degree of expressiveness there, since there are far fewer comments. The project can be assembled with the Stack utility and you can launch a primitive interpreter using the parser we wrote:

$ stack build
$ stack exec demo-parser

For those who want to practice further on their own, I can advise the following:

Grammar can be improved in every way, for example, to allow leading and trailing spaces, add new operations, etc.
The parser translates the string into the internal representation of the expression. This expression can be calculated and converted by the interpreter so that it prints not the result of the parsing but the result of the calculation.
To explore the possibilities of libraries parsec, attoparsec, applicative-parsecand optparse-applicativetry to apply them.

Thanks for attention!

Useful materials

Tags: