Monadic Parser Combinator in Nemerle
Recently I discovered a great Monadic Parser Combinator article about creating parsers. The main idea is to describe the parser as an object of the first class, so that it would be possible to decompose the original problem and present the desired parser in the form of a synthesis of simpler parsers. To facilitate the combination of parsers, the replace error with list of success approach is used. In the haskell language, it is very convenient to work with the creation of parsers, since this task naturally falls on the concept of monads, unfortunately monads are not supported in nemerle, but nevertheless the language is powerful enough to handle this task.
The parser can be represented in more detail as a mapping from a string to a list of tuples, where the first element of the tuple is the parsed part of the string and the second element is the rest of the string. In case of unsuccessful parsing, an empty list is returned. In different cases, what I called the parsed part of the string can have a different type, so it is natural to consider the parser type as polymorphic, for example, string -> list ['a * string]. Below I will designate such construction as parser ['a].
Next you need to consider a combination of parsers. Suppose you need to parse and calculate "sin42", and we already have a parser that can parse mathematical functions - math: parser [double-> double], and a parser that can parse numbers - number: parser [double] (through: the type of parsers is described )
The desired parser can be written as follows:
It turned out quite capacious, but only the red-eyed fan of FP can agree that this is a beautiful code. Below I will define an operator on parsers and this code can be rewritten in the following form:
In simple terms, the action of the '*' operator can be described as creating a new parser that acts as follows: it applies the first parser, then the second parser parameterizes the parsed result of the first and continues parsing. The input operator '*' has a parser ['a] and a function that performs parameterization, such as' a-> parser [' b].
Below, for an example of the simplicity and power of this approach, I wrote a parser that checks the correctness of arithmetic expressions consisting of positive numbers, brackets, and the "+" operation. The code is written in nemerle, but I will try to comment on the syntax, try to understand the semantics yourself, but carefully - do not break your brain! And do not be alarmed at the size of the code, since half of it can be moved to a separate library for creating parsers, in the case of haskell, the library based on this approach is parsec. Go!
A tuple type, in which the first element is of type int, and the second string, in nemerle is written as int * string, the tuple itself, if you have not written about it yet, is written, for example, like this: def tuple = (45, "qwerty"). Now, I hope the list ['a * string] entry is clear.
@ * is not a mate, as my friend thought, but a synonym for operator * in C #. Another syntax feature that you have not seen on this blog yet is fun. This is nothing more than a lambda expression, the other possible syntax for a lambda matches the syntax from C #, for example, the following code examples are equivalent: fun (x) {x + 1} and x => x + 1; optionally you can specify the type, for example, fun (x: int): int {x + 1} or x: int => x + 1. In the case where the lambda can be written and one line I use the second option, otherwise the first.
The operation + for lists is defined in the nonmerl, in the code above it was used inside the body of the lambda.
In non-Merle, as in Java, the call to the base class constructor occurs inside the subclass constructor. The two classes that I defined are fundamental parsers, and many other parsers and parser generators are based on them. The result class can be compared with the monad constructor in haskell. I called a parser generator a function that returns a parser, for example, the symbol (x: char) generator is defined below: Parser [char], which creates a parser that expects the specified symbol at the beginning of the parsing line.
Below I define three parser combinators. Since all this code is written inside a module, which is a static class, I can easily declare static methods of this class (the static keyword is not necessary in this case, since this is a module) and use them as global functions. I called a parser combinator a function that takes a parser or parsers to the input, and returns a parser at the output. The trivial combinator is the identity function, but it is not interesting, so I defined the combinators many, many1 and oneOf. The first one repeats the parser that has input, while the line that it parses allows it (analog * in regular expressions), the second is analog + in the regular expression, and the third tries to apply the parsers passed to it sequentially,
In the last generator, the lambda expression body uses the abbreviated syntax for pattern matching. It can be used when the arguments of the function participate in pattern matching and the body of the function consists of one pattern matching expression. Let me give you equivalent code examples: def foo (x) {match (x) {| [] => "Empty" | _ => "! empty"}} and def foo (x) {| [] => "Empty" | _ => "! empty"}.
Below is the already significant code that the checking parser creates. All the code before this can be taken to the library, since it is the most generalized and not tied to a specific parser.
This code should already be understandable (syntactically), the only thing I missed was not explaining what the word def does. This word comes from the English define what it means to declare. This is exactly what it does: it allows you to give a name to a value and declare local functions (relative to the class method and other local functions). The first one is similar to the behavior of the word var in C #, only in non-measure it is no longer possible to change the value of the name, it is easier to understand this using an example:
One more note to the parser code: expr is declared as a local function, since its description refers to itself.
PS I hope I was able to adequately retell the ideas of the article and better introduce readers to the nemerle language.
The parser can be represented in more detail as a mapping from a string to a list of tuples, where the first element of the tuple is the parsed part of the string and the second element is the rest of the string. In case of unsuccessful parsing, an empty list is returned. In different cases, what I called the parsed part of the string can have a different type, so it is natural to consider the parser type as polymorphic, for example, string -> list ['a * string]. Below I will designate such construction as parser ['a].
Next you need to consider a combination of parsers. Suppose you need to parse and calculate "sin42", and we already have a parser that can parse mathematical functions - math: parser [double-> double], and a parser that can parse numbers - number: parser [double] (through: the type of parsers is described )
The desired parser can be written as follows:
def eval = text => math (text) .Map ((f, rest) => number (rest) .Map ((a, b) => (f (a), b))). Fold ([], (e, acc) => e + acc)
It turned out quite capacious, but only the red-eyed fan of FP can agree that this is a beautiful code. Below I will define an operator on parsers and this code can be rewritten in the following form:
def eval = math * (f => number * (x => result (f (x))))
In simple terms, the action of the '*' operator can be described as creating a new parser that acts as follows: it applies the first parser, then the second parser parameterizes the parsed result of the first and continues parsing. The input operator '*' has a parser ['a] and a function that performs parameterization, such as' a-> parser [' b].
Below, for an example of the simplicity and power of this approach, I wrote a parser that checks the correctness of arithmetic expressions consisting of positive numbers, brackets, and the "+" operation. The code is written in nemerle, but I will try to comment on the syntax, try to understand the semantics yourself, but carefully - do not break your brain! And do not be alarmed at the size of the code, since half of it can be moved to a separate library for creating parsers, in the case of haskell, the library based on this approach is parsec. Go!
using System;
using System.Console;
I connect the necessary namespaces and expand the System.Console class - from now all of its static methods become global functions / procedures. About the module, functional types instead of delegates, the lack of new before calling the constructor and the description of the constructor through the word this, I probably wrote earlier. I only note now that the generic parameters in nemerle are specified through square brackets, and not through angle brackets, that the type name can contain apostrophes and that in nemerl it is considered good practice to give generic parameters names starting with an apostrophe.module Program
{
class Parser['a]
{
public this(core : string -> list['a*string]) { this.core = core }
core : string -> list['a*string];
A tuple type, in which the first element is of type int, and the second string, in nemerle is written as int * string, the tuple itself, if you have not written about it yet, is written, for example, like this: def tuple = (45, "qwerty"). Now, I hope the list ['a * string] entry is clear.
public Parse(text : string) : list['a*string] { core(text) }
public static @*['b](self : Parser['a], parser : 'a->Parser['b]) : Parser['b]
{
Parser(fun(text : string) : list['b*string]
{
def data = self.Parse(text);
data.Map((ast,rest) => parser(ast).Parse(rest)).FoldRight([], (e,acc) => e+acc)
})
}
@ * is not a mate, as my friend thought, but a synonym for operator * in C #. Another syntax feature that you have not seen on this blog yet is fun. This is nothing more than a lambda expression, the other possible syntax for a lambda matches the syntax from C #, for example, the following code examples are equivalent: fun (x) {x + 1} and x => x + 1; optionally you can specify the type, for example, fun (x: int): int {x + 1} or x: int => x + 1. In the case where the lambda can be written and one line I use the second option, otherwise the first.
public static @+(a : Parser['a], b : Parser['a]) : Parser['a]
{
Parser(text => a.Parse(text) + b.Parse(text) )
}
}
The operation + for lists is defined in the nonmerl, in the code above it was used inside the body of the lambda.
class zero['a] : Parser['a] { public this() { base(x=>[]) } }
class result['a] : Parser['a]
{
public this(value : 'a) { base(text => [(value,text)]) }
}
In non-Merle, as in Java, the call to the base class constructor occurs inside the subclass constructor. The two classes that I defined are fundamental parsers, and many other parsers and parser generators are based on them. The result class can be compared with the monad constructor in haskell. I called a parser generator a function that returns a parser, for example, the symbol (x: char) generator is defined below: Parser [char], which creates a parser that expects the specified symbol at the beginning of the parsing line.
Below I define three parser combinators. Since all this code is written inside a module, which is a static class, I can easily declare static methods of this class (the static keyword is not necessary in this case, since this is a module) and use them as global functions. I called a parser combinator a function that takes a parser or parsers to the input, and returns a parser at the output. The trivial combinator is the identity function, but it is not interesting, so I defined the combinators many, many1 and oneOf. The first one repeats the parser that has input, while the line that it parses allows it (analog * in regular expressions), the second is analog + in the regular expression, and the third tries to apply the parsers passed to it sequentially,
many['a] (parser : Parser['a]) : Parser[list['a]]
{
parser * (x => many(parser) * (xs => result(x::xs))) + result([])
}
many1['a] (parser : Parser['a]) : Parser[list['a]]
{
parser * (x => many(parser) * (xs => result(x::xs)))
}
oneOf['a] (params parsers : array[Parser['a]]) : Parser['a]
{
Parser(fun(text){
parsers.FoldLeft([], fun(e,acc)
{
| (_, []) => e.Parse(text)
| (_, _) => acc
})
})
}
In the last generator, the lambda expression body uses the abbreviated syntax for pattern matching. It can be used when the arguments of the function participate in pattern matching and the body of the function consists of one pattern matching expression. Let me give you equivalent code examples: def foo (x) {match (x) {| [] => "Empty" | _ => "! empty"}} and def foo (x) {| [] => "Empty" | _ => "! empty"}.
Attention!
Below is the already significant code that the checking parser creates. All the code before this can be taken to the library, since it is the most generalized and not tied to a specific parser.
Main() : void
{
def item = Parser(x => if (x.Length==0) [] else [(x[0],x.Substring(1))] );
def sat(predicate) { item * (x => if (predicate(x)) result(x) else zero()) }
def symbol(x : char) { sat(y => x==y) }
def digit = sat(x => '0' <= x && x <= '9');
def number = many1(digit);
def expr()
{
def parentheses =
symbol('(') * (_ =>
expr() * ( _=>
symbol(')') * (_ => result([]))
)
);
def trivial = oneOf(number, parentheses);
def mul =
trivial * ( _ =>
many1(symbol('*') * ( _ => trivial)) * (_ => result([]))
);
def sarg = oneOf(mul, trivial);
def sum =
sarg * ( _ =>
many1(symbol('+') * ( _ => sarg)) * (_ => result([]))
);
oneOf(sum,mul,trivial)
}
WriteLine(expr().Parse("5+42*(8+1)"));
_ = ReadLine();
}
}
This code should already be understandable (syntactically), the only thing I missed was not explaining what the word def does. This word comes from the English define what it means to declare. This is exactly what it does: it allows you to give a name to a value and declare local functions (relative to the class method and other local functions). The first one is similar to the behavior of the word var in C #, only in non-measure it is no longer possible to change the value of the name, it is easier to understand this using an example:
def x = 1; // correct
x = x + 1; // compilation error
def x = x + 1; // compilation was successful
mutable y = 1; // mutable - the full analogue of var in the language C #
y = y +1; // all is well.
One more note to the parser code: expr is declared as a local function, since its description refers to itself.
PS I hope I was able to adequately retell the ideas of the article and better introduce readers to the nemerle language.