Haskell applicative parsers
- Tutorial
Motivation
When I first started to learn Haskell, I was very annoyed by the widespread use of complex abstractions instead of some specific solutions. It seemed to me that it would be much better to always follow the KISS principle and write bicycles using elementary language constructions than to sort through all these types of classes in order to write one supposedly convenient structure somewhere in the end.
I lacked a good example where the efforts spent on mastering the "materiel" would pay off. For me, one of the most successful such examples was the parsers. Now I often talk about them when they ask me for which common tasks you can beautifully use Haskell.
I want to suggest that beginners also go through this path and create from scratch a small base of functions for conveniently implementing parsers, and then use it to write your own parser, the code of which will almost literally repeat the grammar that is being analyzed.
I hope someone can help it to overcome the fear of abstractions and teach appropriate to use them (yes, I still believe that it is sometimes more effective to write a bicycle).
I have no goal and no desire to make a Haskell course out of the article from scratch, so I assume that the reader is familiar with the syntax and independently developed simple programs. Just in case, I will briefly talk about the type classes before proceeding to the description of the implementation.
For those who have never written on Haskell, but want to understand what is going on here, I recommend that you first look at the corresponding page on Learn X in Y minutes . As an excellent Russian-language book for beginners, I advise Denis Shevchenko, "About Haskell Humanly" .
I will try to use the most simple language constructs that beginners can understand. At the end of the article there is a link to the source repository, where in some parts of the code a more convenient and short record is used, which may be less clear at a glance.
And yes, gentlemen, Haskelists, many things are explained very simply and clumsily, for particular cases, not very abstract, without the use of terms from category theory and other scary words. I am glad that you know them and of course easily mastered them. I also know them, but I do not consider it necessary to throw out such a volume of information in this context to unprepared readers.
Type Classes
Type classes in Haskell have nothing to do with classes in C ++ and other object-oriented languages. If we draw an analogy with OOP, the types of classes are more like the overloading of methods and functions.
Classes define what actions can be performed with objects of the types that are included in the class. For example, all numbers can be compared by equality, but everything can be ordered except complex ones, and functions in general cannot be compared at all. The class of types that can be compared, called Eq
, ordered - Ord
(types do not have to be numeric). What can be printed by translating into a string belongs to a class Show
, it has an “opposite” class Read
that determines how to convert strings to objects of the desired type.
For a set of standard types of classes (such as Eq
, Show
, Read
...) you can ask the compiler to implement the desired functionality in the standard way using a keyword deriving
after determining the type:
dataPoint = Point
{ xCoord :: Float
, yCoord :: Float
} deriving (Eq, Show)
You can define your own type classes:
classPrettyPrint a where
pPrint :: a -> String
Here PrettyPrint
is the class name, a
is the type variable. The keyword is where
followed by a list of the so-called class methods, i.e. functions that can be applied to objects of type from this class.
In order to denote the belonging of a data type to a class, the following construction is used:
instancePrettyPrintPointwhere
pPrint (Point x y) = "(" ++ show x ++ ", " ++ show y ++ ")"
The language allows you to specify restrictions on the types of classes to which the function arguments should refer:
showVsPretty :: (Show a, PrettyPrint a) => a -> (String, String)
showVsPretty x = (show x, pPrint x)
For each function call, the compiler checks whether these requirements for the type are met, and if it fails, it displays an error (of course, this happens at the compilation stage).
>>> showVsPretty (Point23)
("Point {xCoord = 2.0, yCoord = 3.0}","(2.0, 3.0)")
>>> showVsPretty "str"error:
Noinstance for (PrettyPrint [Char]) arising from a use of ‘showVsPretty’
Implementation
The parser receives as input a string that should be parsed according to predefined rules and get the value of the type we need (for example, an integer). In this case, the input line may not end, and the remainder will serve as an input for further parsing. In addition, our parser will generally be non-deterministic, i.e. will return several possible parse results as a list.
For the description of one result of the parser, a tuple of two elements is suitable (String, a)
, where a
is a type variable that can denote any custom type.
Since the parser parses the string according to some rules, we describe it as a function that takes a string as input and returns a list of results:
newtypeParser a = Parser { unParser :: String -> [(String, a)] }
We will consider the parsing successful if the list of results consists of one element and the input string has been completely processed. We implement an auxiliary function that attempts to perform an unambiguous parsing of the entire string:
parseString :: String -> Parser a -> Maybe a
parseString s (Parser p) = case (p s) of
[("", val)] -> Just val
_ -> Nothing
Simple parsers
We implement several simple parsers, which will then come in handy in building more complex combinations.
We begin by parsing a single character that must satisfy the predicate. If the input string is empty, then the result of the work is an empty list. Otherwise, check the value of the predicate on the first character of the string. If the value is returned True
, the parsing result is this character; return it along with the rest of the string. Otherwise, parsing also ends in failure.
predP :: (Char -> Bool) -> ParserCharpredP p = Parser f
where
f "" = []
f (c : cs) | p c = [(cs, c)]
| otherwise = []
Now we can write a parser that accepts a specific character at the beginning of a line. To do this, use the newly written one predP
and pass it to it as an argument a function that compares its argument with the symbol we need:
charP :: Char -> ParserCharcharP char = predP (\c -> c == char)
The following simplest case: a parser that accepts only a specific string entirely. Let's call him stringP
. The function inside the parser compares the input string with the required one and, if the lines are equal, returns a list of one element: a pair of empty lines (nothing left at the input) and the original one. Otherwise, the parsing failed, and an empty list of results is returned.
stringP :: String -> ParserStringstringP s = Parser f
where
f s' | s == s' = [("", s)]
| otherwise = []
Quite often, you need to skip characters that have a certain property while they are at the beginning of a line (for example, whitespace characters). At the same time, the result of the analysis is not important to us and will not be useful in the future. Let's write a function skip
that skips the initial characters of the string, while the true value of the predicate is preserved. As a result of the analysis we use an empty tuple.
skip :: (Char -> Bool) -> Parser ()
skip p = Parser (\s -> [(dropWhile p s, ())])
The following two parsers are very similar to each other. Both check the input string prefix, only the first returns the prefix if successful, and the second returns an empty tuple, i.e. allows you to skip an arbitrary string at the beginning of the entry. For implementation, the function isPrefixOf
defined in the module is used Data.List
.
prefixP :: String -> ParserStringprefixP s = Parser f
where
f input = if s `isPrefixOf` input
then [(drop (length s) input, s)]
else []
skipString :: String -> Parser ()
skipString s = Parser f
where
f input = if s `isPrefixOf` input
then [(drop (length s) input, ())]
else []
A little later, we will look at a simpler implementation of the last function and get rid of code duplication.
Parser as a functor
We can distinguish a whole class of container types for which the following is true: if you know how to transform objects inside a container, then you can convert the containers themselves. The simplest example is a list as a container and a function map
that exists in almost all high-level languages. Indeed, you can go through all the elements of a type list [a]
, apply a function to each a -> b
and get a type list [b]
.
This type class is called Functor
; the class has one method fmap
:
classFunctor f where
fmap :: (a -> b) -> f a -> f b
Suppose that we already know how to parse strings into objects of a certain type a
, and, moreover, we know how to convert objects of a type a
into objects of a type b
. Is it possible to say that then there is a parser for type objects b
?
If we express this in the form of a function, then it will have the following type:
(a -> b) -> Parser a -> Parser b
This type coincides with the type of the function fmap
, so we will try to make the parser a functor. Create a parser of type values from scratch b
that will first call the first parser (we already have one), and then apply the function to the results of its parsing.
instanceFunctorParserwhere
fmap :: (a -> b) -> Parser a -> Parser b
fmap f (Parser p1) = Parser p2
where
p2 :: String -> [(String, b)]
p2 s = convert (p1 s)
convert :: [(String, a)] -> [(String, b)]
convert results = map (\(s, val) -> (s, f val)) results
At the function fmap
there is a convenient infix synonym: fmap f x == f <$> x
.
If, as an argument, we fmap
use a function that simply replaces its first argument with a new value, we will get another useful operation, which has already been implemented for all functors even in two copies (they differ only in the order of the arguments):
(<$) :: Functor f => a -> f b -> f a
($>) :: Functor f => f a -> b -> f b
Remember the parser that skips a certain string ( skipString
)? Now you can implement it as follows:
skipString :: String -> Parser ()
skipString s = () <$ prefixP s
Parser Combinations
In Haskell, all functions are curried by default and allow partial use. This means that the function of the n
arguments is actually a function of one argument, which returns a function of the n-1
arguments:
cons :: Int -> [Int] -> [Int]
cons = (:)
cons1 :: [Int] -> [Int]
cons1 = cons 1-- функция cons применена частично
Apply the function of three arguments to some value inside the parser, using fmap
. Types will be as follows:
f :: c -> a -> b
p :: Parser c
(fmap f p) :: Parser (a -> b)
The parser of function has turned out ?! Of course, it is possible that the input line actually contains the function representation, but I would like to be able to use this function, or rather combine parsers, Parser (a -> b)
and Parser a
to get Parser b
:
applyP :: Parser (a -> b) -> Parser a -> Parser b
The type of this function is very similar to the type fmap
, only the function itself that needs to be applied is also in the container. This gives an intuitive understanding of how the implementation of the function should look like applyP
: get the function from the container (as a result of applying the first parser), get the values to which the function should be applied (the result of applying the second parser) and “pack” the values converted using this function back to container (create a new parser). In the implementation we will use list comprehension:
applyP :: Parser (a -> b) -> Parser a -> Parser b
applyP (Parser p1) (Parser p2) = Parser f
where f s = [ (sx, f x) | (sf, f) <- p1 s, -- p1 применяется к исходной строке
(sx, x) <- p2 sf] -- p2 применяется к строке, оставшейся после предыдущего разбора
There is a class Applicative
that has a method with the same prototype. The second class method is called pure
and is used to “wrap” or “lift” ( lift ) a value, including a functional one. In the case of implementation for the parser, the function pure
adds its argument to the result of the parser, without changing the input string.
classFunctor f => Applicative f where
pure :: a -> f a
(<*>) :: f (a -> b) -> f a -> f b
instanceApplicativeParserwhere
pure x = Parser (\s -> [(s, x)])
pf <*> px = Parser (\s -> [ (sx, f x) | (sf, f) <- unParser pf $ s,
(sx, x) <- unParser px $ sf])
The function applyP
is <*>
from the class Applicative
. Types belonging to this class are called applicative functors.
For applicative functors, two auxiliary functions are implemented that will be useful to us:
(*>) :: f a -> f b -> f b
(<*) :: f a -> f b -> f a
These functions perform two consecutive actions and return the result of only one of them. For parsers, they can be used, for example, in order to skip leading gaps before parsing the part of the line that carries the meaning.
By combining <$>
and <*>
creating very convenient designs. Consider the following data type:
dataMyStructType = MyStruct
{ field1 :: Type1
, field2 :: Type2
, field3 :: Type3
}
The value constructor MyStruct
is also a function, in this case it has a type Type1 -> Type2 -> Type3 -> MyStructType
. You can work with the constructor as with any other function. Suppose that parsers are already written for the structure field types:
parser1 :: ParserType1parser2 :: ParserType2parser3 :: ParserType3
Using the function, fmap
you can partially apply MyStruct
to the first of these parsers:
parserStruct' :: Parser (Type2 -> Type3 -> MyStructType)
parserStruct' = MyStruct <$> parser1
Let's try to continue to use the function, which is now "inside" the parser. For this you need to use <*>
:
parserStruct'' :: Parser (Type3 -> MyStructType)
parserStruct'' = parserStruct' <*> parser2
parserStruct :: ParserMyStructTypeparserStruct = parserStruct'' <*> parser3
As a result, we got a parser for the whole structure (of course, here we use the assumption that in the initial line of the presentation of its fields go in a row). The same can be done in one line:
parserStruct :: ParserMyStructTypeparserStruct = MyStruct <$> parser1 <*> parser2 <*> parser3
Such designs will often be found in the example of use.
Now suppose that we are trying to write a parser that parses simple arithmetic expressions in which integers and identifiers can be present as operands. Create a separate type for them Operand
:
dataOperand
= IntOpInt
| IdentOpString
If we can already parse integers and identifiers (for example, as in C), then we need one parser for operands that can parse one or the other. This parser is an alternative of the other two, so we need a function that can combine parsers so that the results of their work are combined. The result of the parser is a list, and the union of lists is their concatenation. We implement the function altP
combining two parsers:
altP :: Parser a -> Parser a -> Parser a
altP (Parser p1) (Parser p2) = Parser (\s -> p1 s ++ p2 s)
Then the operand parser can be implemented using this function (here it is assumed that parserInt
they are parserIdent
already described somewhere:
parserOperand :: ParserOperandparserOperand = altP parserIntOp parserIdentOp
where
parserIntOp = IntOp <$> parserInt
parserIdentOp = IdentOp <$> parserIdent
Of course, for the alternatives have already come up with a separate class, which is called Alternative
. It has another method empty
, describing a neutral element for an alternative operation. In our case, this is a parser that never parses anything, i.e. always returns an empty list of results. For the parser, the implementation of the class methods Alternative
looks like this:
classApplicative f => Alternative f where
empty :: f a
(<|>) :: f a -> f a -> f a
instanceAlternativeParserwhere
empty = Parser (const [])
px <|> py = Parser (\s -> unParser px s ++ unParser py s)
An operation <|>
is a function altP
, only in the infix notation, which is more convenient to use, combining several parsers in a row.
For all types in this class two functions are implemented, some
and many
types f a -> f [a]
. Each of them can be expressed through the other:
some v = (:) <$> v <*> many v
many v = some v <|> pure []
In terms of parsers, these functions allow you to parse data sequences, if you know how to parse one data element. If used, the some
sequence must be non-empty.
Usage example
Now we are ready to write our parser, for example, for simple arithmetic expressions with the following grammar:
expr ::= constExpr | binOpExpr | negExpr
const ::= int
int ::= digit{digit}
digit ::= '0' | ... | '9'
binOpExpr ::= '(' expr ' ' binOp ' ' expr ')'
binOp ::= '+' | '*'
negExpr ::= '-' expr
The expression consists of integer constants, unary minus and two infix binary operations: addition and multiplication. Brackets are required around an expression with a binary operation, the operation symbol is separated from the operands by exactly one space, leading and trailing spaces are not allowed.
Examples of correct expression writing:
"123"
"-(10 + 42)"
"(1 + ((2 + 3) * (4 + 5)))"
Examples of incorrect entries:
" 666 "
"2 + 3"
"(10 * 10)"
We declare the necessary data types (the expression itself and the binary operation):
dataExpr = ConstExprInt
| BinaryExprExprOperatorExpr
| NegateExprExprdataOperator = Add | Mul
You can start parsing! The expression itself consists of three alternatives. So we write:
-- expr ::= constExpr | binOpExpr | negExprexprParser :: ParserExprexprParser = constParser <|> binParser <|> negParser
The constant is a positive integer. In our data type, it is "wrapped" in the constructor, so we cannot use the parser for an integer directly, but we can use fmap
it to get the value of the desired type.
-- const ::= intconstParser :: ParserExprconstParser = ConstExpr <$> intParser
The integer, according to the grammar, is represented as a non-empty sequence of numbers. To parse a single digit, we use an auxiliary function predP
and a predicate isDigit
from a module Data.Char
. Now, to build a parser to parse a sequence of numbers, we use the function some
(not many
, because there must be at least one digit). The result of this parser returns a list of all possible parsing options, starting with the longest entry. For example, if the input string "123ab", the results list will be as follows: [("ab", "123"), ("3ab", "12"), ("23ab", "1")]
. We need to parse the longest sequence of numbers and convert it to a type Int
. The whole implementation is as follows:
-- int ::= digit{digit}-- digit ::= '0' | ... | '9'intParser :: ParserIntintParser = Parser $ \s -> let res = unParser (some digitParser) s incase res of
[] -> []
((rest, i) : xs) -> [(rest, read i)]
where
digitParser = predP isDigit
The next variant of the expression is the use of a binary operation. According to the grammar, the input string must first include an opening bracket, the first operand, a space, an operation symbol, another space, the second operand, and a closing bracket. To parse individual characters (brackets and spaces) use the function charP
. Operands are expressions, and there is already a parser to parse them ( exprParser
). To parse the binary operation symbol, we will describe the auxiliary parser just below. It remains to carefully combine this set of parsers. At the beginning and at the end of the expression there should be brackets: you need to check this, but discard the result itself. For this we use *>
and <*
:
binParser :: ParserExprbinParser = charP '(' *> ??? <* charP ')'
Between these parsers for brackets, the construction of the expression with the help of the constructor BinaryExpr
and the parsers for the expression and operation should occur . Let's not forget about the spaces around the operation symbol, using the same method as for parentheses. This part is as follows:
BinaryExpr <$> exprParser -- первый операнд
<*> (charP ' ' *> binOpParser <* charP ' ') -- операция, окружённая пробелами
<*> exprParser -- второй операнд
Substitute this expression instead of question marks:
-- binOpExpr ::= '(' expr ' ' binOp ' ' expr ')'binParser :: ParserExprbinParser =
charP '(' *>
(BinaryExpr <$> exprParser
<*> (charP ' ' *> binOpParser <* charP ' ')
<*> exprParser
)
<* charP ')'
A binary operation is either a symbol +
that understands the value Add
, or *
that understands Mul
:
-- binOp ::= '+' | '*'binOpParser :: ParserOperatorbinOpParser = plusParser <|> multParser
where
plusParser = charP '+' $> Add
multParser = charP '*' $> Mul
Remained the simplest part of the grammar, the negation of expression. We -
do the same with the symbol as with brackets and spaces. Next, apply the constructor NegateExpr
to the result of recursive parsing:
-- negExpr ::= '-' exprnegParser = charP '-' *> (NegateExpr <$> exprParser)
So, all parts of the parser are implemented. The code in many ways resembles a grammar and completely coincides with it in structure.
The source code is available on GitLab: https://gitlab.com/fierce-katie/applicative-parsers-demo .
It is easier to estimate its volume and degree of expressiveness there, since there are far fewer comments. The project can be assembled with the Stack utility and you can launch a primitive interpreter using the parser we wrote:
$ stack build
$ stack exec demo-parser
For those who want to practice further on their own, I can advise the following:
- Grammar can be improved in every way, for example, to allow leading and trailing spaces, add new operations, etc.
- The parser translates the string into the internal representation of the expression. This expression can be calculated and converted by the interpreter so that it prints not the result of the parsing but the result of the calculation.
- To explore the possibilities of libraries
parsec
,attoparsec
,applicative-parsec
andoptparse-applicative
try to apply them.
Thanks for attention!