ppopoff November 9, 2015 at 22:18

About Parboiled (Part 2)

Tutorial

Part 2. Matching text

In the second part of the series, we will talk about the basic rules for matching characters in Parboiled. We will not touch on all the rules - there is documentation for this, I just want you to feel confident with the basic rules syntax used in Parboiled.

To consolidate knowledge, we will write a simple recognizer for simple grammar. It is a recognizer, and not a full-fledged parser, since it will only match the input text with the rules described by us (also called products), but will not extract any values from the matched text. The recognizer can be useful on its own, as it can work as a validator: if the input turned out to be incorrect, the recognizer will let you know about it and tell you what went wrong and where. And our recognizer will become very cool when we learn how to retrieve the parsed values and what does this have to do with some kind of “value stack”. Here we go?

Cycle structure:

Preparatory work

Before you start working with the library, add it to the classpath. In Maven, for example, this is done like this:

org.parboiledparboiled_2.112.1.0

I use Scala 2.11, but there are artifacts for 2.10 as well.

Rule Description Language (Rule DSL)

All Parboiled functionality is implemented on top of Scala syntax using specialized DSL. Therefore, the description of the parser is actually nothing more than a declaration of a class derived from org.parboiled.Parser. As an example, we write a parser that does nothing, which does not prevent it from existing and enjoying life:

import org.parboiled2._
class MyParser(val input: ParserInput) extends Parser {
  // Здесь описана ваша грамматика
}

DSL constructs and a number of useful classes are added to the scope with just one import directive. I want to note that the presence of a parameter inputin the constructor is mandatory: this means that for each new set of input data you need to create a new parser object. At first it really scared me, but I stopped being afraid when I saw how fast it worked.

Rules for individual characters

So, when we already have a worthless parser, we need to add a few rules to it, according to which it will process the data. If you worked with Parboiled1, you can simply scroll through this section, as my explanations may seem to you too detailed.

Let's start with the terminals. This term will be used in the future, so let's try to give it a definition here (not quite, however, strict):

A terminal is a simple atomic rule that does not require additional definitions.

Let's describe two simple rules: the first should recognize some known character in advance, the second - a line:

def MyCharRule   = rule { ch('a') }
def MyStringRule = rule { str("string") }

Each time, designating one’s intentions in this way is very tiring. And here we come to the aid of the mechanism of implicit conversions, which allows us to make the rules shorter:

def MyCharRule   = rule { 'a' }
def MyStringRule = rule { "string" }

Strings are case-sensitive. However, there are many case-insensitive languages (such as SQL). For them, there is a rule ignoreCasematching the input string regardless of its case. The string passed to it must be lowercase:

def StringWithCaseIgnored = rule { ignoreCase("string") }

More information about the rules (or “products” if you like more) will be described in the next article. All of the above (and below) rules are of type Rule0. There are different types of rules, but now we only need to know what it Rule0means that the rule matches the input string and says whether it matched or not. We did not specify the type because the mechanism of type inference of the language so far can easily handle itself. However, nothing prevents us from specifying the type explicitly:

def StringWithCaseIgnored: Rule0 = rule { ignoreCase("string") }

There are special terminals in Parboiled (they are also syntactic predicates):

ANY- any character except EOI.
EOI(End of Input) - a virtual symbol-marker of the end of input, which you will definitely want to add to the main rule of your parser. Defined EOIas follows:
```
val EOI = '\uFFFF'
```

Despite the fact that the U + FFFF symbol is reserved for internal use by the Unicode standard, in practice it can easily occur in user input and change the parser behavior. Therefore, be careful with the text that gets to the input.

In addition, if you do not add a EOIrule at the end of the main rule and an error occurs during the comparison, then you will not know about it, since the parser will consider that the input data has not yet ended and will wait for new data to arrive. Therefore, no matter what you apply for input, a meaningless Success awaits you at the output.

Of the rules chrand strit is hardly possible to make a useful parser, so the first step to understanding is the ability to determine the range ofvalid characters. In Parboiled2 this is done very easily:

def Digit      = rule { '0' - '9' }
def AlphaLower = rule { 'a' - 'z' }

Both of these rules will match at most one character from the range (or not match any). Although it is very simple to write these two rules specifically in PB2, there is no need to do this: they are already defined in the object CharPredicate. Parboiled1, on the contrary, forced you to manually create these rules, almost every time you write another parser. Therefore, I carried my library of primitives from project to project (I am sure that I was not the only one to do this). Now my library has noticeably been relieved by the appearance CharPredicate. It includes, for example, the following rules (I think that from the names it will be clear which categories of characters they correspond to):

CharPredicate.All(works almost the same as ANY, but shows worse performance over large ranges of characters);
CharPredicate.Digit;
CharPredicate.Digit19;
CharPredicate.HexDigit and many other rules.

If you are not satisfied with the existing rules, you can always define your own character predicate, for this you need to use the method from:

CharPredicate from (_.isSpaceChar)

In addition, operators except( --) and union( ++) are defined for symbolic predicates , which were not in PB1. Personally, I suffered a lot from this absence: I had to close the rule “on the other hand,” listing a completely black or white list of characters depending on the situation. A rule --can also be called a difference, since its role is the same as that of the difference of two sets .

// Сопоставит любой печатный символ, если это не кавычка.
def AllButQuotes = rule { CharPredicate.Visible -- "\"" -- "'" }
// Неплохо подойдет для определения идентификатора. Обратите внимание, как
// AlphaNum объединяется с нижним подчеркиванием.
def ValidIdentifier = rule {
  CharPredicate.Alpha ~ zeroOrMore(CharPredicate.AlphaNum ++ "_") }

It will be useful to know two more rules: anyOfand noneOf. They are very similar to exceptand union, but work on the entire space of characters ANY. And most importantly: in this space they work faster. These functions can take an input string consisting of character enumerations. For instance:

// Определит, является ли символ одной из арифметических операций.
def ArithmeticOperation = rule { anyOf("+-*/^") }
// Сопоставит всё, кроме перечисленных пробельных символов и EOI.
def WhiteSpaceChar = rule { noneOf(" \t\n") }

Sometimes the question arises, what to choose: anyOf/ noneOfor CharPredicate? A predefined character predicate will work faster for 7-bit ASCII characters. “Predefined” is written for a reason, and the section “Best Practices” in Part 4 will explain why. However, for very large ranges of character CharPredicatebehaves frankly bad, and then to help should come anyOfand noneOf.

Rule chains

N.times

Matching single characters is not interesting, so let's move on to more complex rules. Let's start with times, which allows you to match one rule several times in a row. The number of repetitions must be accurate and known in advance.

def BartLearningParboiled = rule {
  100 times "I will never write a parser again. "
}

Some grammars require a tight range of repetitions, for example, two to five . In the new Parboiled this can be easily arranged:

def FutureOfCxx = rule { 'C' ~ (2 to 5).times('+') }

And in the old - there is a rule nTimesthat requires an indication of the exact number of repetitions. If the exact number of repetitions is not known in advance, the following couple of rules will help you.

zeroOrMore

As you probably guessed from the name, zeroOrMore matches a sequence of zero or more occurrences of the specified rule. An attentive reader has already noticed this rule in the examples, and it most likely seemed familiar to him: in regular expressions, the exact same operation is indicated by an asterisk, and lovers of academic terminology, in addition, know that it is called the Kleene star . In any case, using this rule is very simple:

def Whitespace = rule { anyOf(" \n\t") }
def OptWs      = rule { zeroOrMore(Whitespace) }

oneOrMore

A rule similar to the previous one. It does almost the same thing as it zeroOrMoredoes, but requires at least one repetition to be present in the input. Identical to Plus Kleene for regular grammars.

def UnsignedInteger = rule { oneOrMore(CharPredicate.Digit) }

Chain Separator: separatedBy

Often you have to deal with the case when many elements are written in a row through some separator: these are CSVs, and definitions of lists or arrays, and enumerations of function arguments separated by commas, and much more. In Parboiled2, parsing such sequences is easy and effortless:

def CommaSeparatedNumbers = rule { oneOrMore(UnsignedInteger).separatedBy(",") }

However, the first version uses a less elegant syntax for this:

def CommaSeparatedNumbers = rule { oneOrMore(UnsignedInteger, separator = ",") }

Sequence Operator (~)

An operator is used to indicate a sequence of rules ~. Regular expressions do not need such an operator, there this fact is written directly, just like in some variants of BNF. For example, write a (extremely simplified) rule that matches the date of a certain format:

import CharPredicate.Digit
// Дата должна иметь следующий формат: "yyyy-mm-dd"
def SimplifiedRuleForDate = rule { Year ~ "-" ~ Month ~ "-" ~ Day }
def Year  = rule { Digit ~ Digit ~ Digit ~ Digit }
def Month = rule { Digit ~ Digit }
def Day   = rule { Digit ~ Digit }

As you can see, the rule is maximally simplified, and I am well aware that we can have 99 days and 99 months. It does not make sense to leave all the checks at the parser level: we will still pass the mapped string to the input of some class to work with the date and time, which guesses to validate, and will return the result wrapped in Option. But we will simplify the grammar with this. An attempt to force the parser to perform all possible and impossible checks often leads to similar results .

"Optional" rule (optional)

If there was a rule zeroOrOne, then this would be optional: either there is one occurrence, or there are no occurrences at all. Let's look at the following example: in different families of operating systems, the end-of-line marker is encoded differently. For example, on Unix-like operating systems, only a character is needed \n, while on Windows, a sequence of two characters has historically been used: \rand \n. And if we want to process the text created in any of these systems, then we can use the following rule for the end of the line:

def Newline = rule { optional('\r') ~ '\n' }

Ordered Choice (|)

An analogue of an operator |in regular expressions, for a reason called an ordered choice. Suppose we need to recognize a number that may have a sign, or it may not. A sign, if it exists, can be of two types: positive and negative, we will first deal with it:

def Signum = rule { '+' | '-' }

The sign may not appear at all in the record of a positive number:

def MaybeSign = rule { optional(Signum) }

Then, in any case, the number itself will appear as a sequence of a possible occurrence of the sign of the number and its modulus - the unsigned number:

def Integer = rule { MaybeSign ~ UnsignedInteger }

The order of listing the options in the rule Signummatters: the very first of the suitable options is selected, which eliminates the possibility of ambiguity in the grammar. And yes, this is how all PEG parsers work without exception. So, if you need to parse an expression in C, you need to start the enumeration with the longest operations so that they match first, as the standard requires. A simplified rule may look, for example, like this:

def Operator = rule {
  "+=" | "-=" | "*=" | "/=" | "%=" | "&=" | "^=" | "|=" | "<<=" | ">>=" |
  "<<" | ">>" | "<=" | ">=" | "==" | "!=" |
  "||" | "&&" | "->" | "++" | "--" |
  "<"  | ">"  | "+"  | "-"  | "&"  | "|" | "." |
  "*"  | "/"  | "!"  | "~"  | "^"  | "=" | ","
}

The order of listing can be very different, but you need to ensure that it +always goes after +=and ++, and <- after <=and <<(and <<, in turn, after <<=). Otherwise it may happen that a component assignment statement <<=parsed into a sequence [ <=, =], and if not [ <, <, =].

If the selection rule becomes excessively complex and we don’t want to rely on the order of its elements, we should group them by common prefixes (factorize the parser):

def Operators = rule {
  ("+" ~ optional("=" | "+")) |
  ("<" ~ optional("=" | ("<" ~ optional("=")))) | ...
}

We note, however, that none of our examples will be able to automatically take into account the priorities of operators, for this we will have to resort to more sophisticated rules.

A little sugar

For optional, oneOrMoreand zeroOrMorethere is a syntactic sugar, which allows to make the determination even shorter: .?, .+and .*. Please use them wisely: if you abuse them, your rules will be read a little better than regular ones. Using these “shortcuts” we can make the description of our rules less verbose:

import CharPredicate.Digit
def SignedInteger   = rule { ("+" | "-").? ~ Digit.+ }
def Newline = rule { '\r'.? ~ '\n' }
def OptWs = rule { WhitespaceChar.* }

Launch parser

In order to force the written parser to do at least something useful, you need to call the method of runits main (root) rule. If you are writing a unit test for a parser, then it may make sense to call this method for other rules as well. Brackets after the method are required.

Let's make our useless parser able to match only one string constant. So our parser is defined as follows (do not forget about EOI):

import org.parboiled2._
class MyParser(val input: ParserInput) extends Parser {
  def MyStringRule: Rule0 = rule { ignoreCase("match") ~ EOI }
}

Now somewhere in another place we will create several instances of parsers and give them different data as input:

val p1 = new MyParser("match")
val p2 = new MyParser("Match")
val p3 = new MyParser("much")
// по-умолчанию возвращают scala.util.Try
p1.MyStringRule.run()      // Success
p2.MyStringRule.run()      // Success
p3.MyStringRule.run()      // Failure

Running rules in Parboiled2 is much simpler than in Parboiled1, for which there is a whole zoo of parser runners, which have to be additionally called up. For more information, please refer to the “Error Reporting” section of Part 4.

Nested Data Structures

Parsing recursive structures is what Parboiled can and regular expressions can't. In Parboiled, this turns out naturally and naturally, which we will demonstrate in the following examples. The only extra effort that is required of you is to explicitly declare the type of rules involved in the recursion.

Analysis of recursive structures is usually illustrated by the example of a calculator of arithmetic expressions. In my opinion, the example is completely not clear. Therefore, we will consider a fictitious configuration file format consisting of named blocks that contain key-value pairs.

BKV format (Block-Key-Value)

As an example, we will use the “BKV” format, which was invented specifically for this tutorial. It was inspired by the HOCON format and, in fact, is a subset of it. BKV consists of key – value pairs and blocks within which pairs can be placed. It looks something like this:

server.name = "webserver"
server {
  port    = "8080"
  address = "192.168.88.88"
  settings {
    greeting_message = "Hello!\n It's me!"
  }
}

As you can see, the format is simple and straightforward, although escaping lines can scare those who have never written parsers. Escaping is very common with parsing, so we will certainly consider it in more detail.

Escaped strings

In order to avoid problems with whitespace and unprintable characters during parsing, in most grammar strings are enclosed in double or single quotes (or some kind of similarity, for example, opening and closing angle brackets can be used). Unprintable characters and quotation marks are escaped.

In order to write a recognizer of escaped strings, you need to decide on the following syntax elements:

The characters that open and close the string (in our case, this is the same character - a double quote).
The escape character (in our case, it is the backslash character).
A set of mnemonics to indicate non-printable characters (we will support, at a minimum '\n', '\t'and '\v').

First, let's try to describe the rule for a quota string without escaping:

def OverlySimplifiedQuotedString = rule {
  '"' ~ zeroOrMore(AllowedChar) ~ '"'
}

Since blank lines are also possible, we use a rule between quotation marks zeroOrMore. Obviously, the double quotation mark is not included in the list of valid characters. What is then allowed? All that is not forbidden. Therefore, for our case, the list of allowed characters looks like this:

def AllowedChar = rule { noneOf("\"") }

Without a double quote, you can live, but hard. But what happens if we add a quotation mark inside the string? Having met it, the parser will think that the line has ended, and will explode with an error message on the next character.

The escape character warns the parser that the next character is special. The algorithm looks like this: the parser expects one of the allowed characters or an escaped sequence, and the escaped sequence consists of an escape character and the next operator to select one of the characters:

def AllowedChar = rule {
  noneOf("\"\\") | EscapeSequence
}
// Поддерживаются последовательности: \", \\, \n, \a, \f, \v.
def EscapeSequence = rule {
  '\' ~ anyOf("\"\\nafv")
}

Having figured out how this works, you can proceed to write the final version of the rules for screening. To do this, I propose to create a dedicated trait:

import org.parboiled2._
object QuotedStringSupport {
  val CharsToBeEscaped = "abfnrtv\\\""
  val Backslash = '\\'
  val AllowedChars = CharPredicate.Printable -- Backslash -- "\""
}
trait QuotedStringSupport { this: Parser =>
  import QuotedStringSupport._
  def QuotedString: Rule0 = rule {
    '"' ~ QuotedStringContent  ~ '"'
  }
  def QuotedStringContent: Rule0 = rule {
    oneOrMore(AllowedChars | DoubleQuotedStringEscapeSequence)
  }
  def DoubleQuotedStringEscapeSequence = rule {
    '\\' ~ anyOf(CharsToBeEscaped)
  }
}

Now that we have figured out the lines and highlighted the corresponding functionality in a separate trait, let's move on to the format itself.

Auxiliary terminals

There are two approaches to writing parsers: “from the general to the particular” and “from the particular to the general”. Typically, grammars are described according to the first, but this is just a tutorial, so let's start with smaller details, and then generalize.

We begin the description with auxiliary elements, namely, with spaces. In our case, the spaces will be the characters: '', '' and ''. Of course, there are much more whitespace in nature, but in the example we will limit ourselves to three. There are several ways to deal with spaces:

list characters through an ordered selection operator;
declare your own CharPredicatecontaining these three characters;
to use anyOf.

We will use the latter. At the same time, we’ll take into account that in some places there may be several gaps, in others there shouldn’t be any spaces at all, and in some places there must be spaces (but our format does not require spaces):

val WhitespaceChars = "\n\t "
def WhiteSpace = rule { anyOf(WhitespaceChars) }
def OptWs      = rule { zeroOrMore(WhiteSpace) }

The rule describing the line feed we announced earlier:

def Newline = rule { optional('\r') ~ '\n' }

The Key and Block Name are an identifier similar to the one that you can find in various programming languages. The identifier must begin either with the letter of the English alphabet (case does not matter) or with the underscore. In the middle it can contain numbers as well as letters of the English alphabet (lowercase and uppercase). The entry of a point in the middle of the identifier is also valid. Before declaring a key, declare a rule that describes the identifier. (Similar rules will apply to the block name). We need two symbolic predicates: for the first and subsequent characters.

// Первый символ идентификатора
val IdentifierFirstChar = CharPredicate.Alpha ++ '_'
// Для последующих символов
val IdentifierChar      = CharPredicate.AlphaNum ++ '.' ++ '_'

We also declare the beginning and end characters of the block:

val BlockBeginning  = '{'
val BlockEnding     = '}'

Now that we have all the necessary auxiliary terminals, we’ll deal with larger blocks.

Key-Value Pairs

Now let's move on to the syntax of key-value pairs. We require that the key be a valid identifier, as described above, and the value should be a quota string, as also described above. So, let's start by defining an identifier:

def Identifier = rule {
  IdentifierFirstChar ~ zeroOrMore(IdentifierChar)
}

Perhaps we should not have set the identifier with a rather strict rule, but in most grammars that you are likely to encounter, similar rules are used. For example, identifiers are forbidden to start with a number, due to the presence of integer literals, various characters can be valid operators. The rule describing the key will look like this:

def Key = rule { Identifier }

To describe the value, we will use the existing rule (for this we just need to add the trait we wrote earlier):

def Value = rule { DoubleQuotedString }

Now we describe the rule for the whole pair. Here it is worth recalling once again that Parboiled is a PEG, it follows from this that we constantly need to remember the gaps and tell the rule about the places where they can occur.

def KeyValuePair = rule { Key ~ OptWs ~ "=" ~ OptWs ~ Value }

Nested blocks

A block is limited by curly braces and can contain inside both key-value pairs and other blocks. Therefore, for starters, you need to erase the differences between blocks and key – value pairs, calling both of them nodes of the syntax tree. This is done with the following code:

// Тип правила обязателен к указанию, иначе не взлетит!
def Node: Rule0 = rule { KeyValuePair | Block }

Since both the block and the root structure consist of a list of nodes, we need to declare a rule for this list:

def Nodes = rule {
  OptWs ~ zeroOrMore(Node).separatedBy(Newline ~ OptWs) ~ OptWs
}

Optional spaces can be before the list of nodes, and after it, and between its individual elements, so we have so many entries in the rule MaybeWs. Now we define the name of the block - this is the same identifier that is used in the key name:

def BlockName = rule { Identifier }

Finally, we have everything we need to declare the entire block, so declare the block:

def Block = rule { BlockName ~ "{" ~ Nodes ~ "}" }

Remember we identified BlockBeginningand BlockEnding? We use them in the ad:

def Block = rule { BlockName ~ BlockBeginning ~ Nodes ~ BlockEnding }

Notice that it Blockrefers to a rule Nodesthat will refer to a Node rule. Node can refer as a Block rule, which causes a loop. Therefore, we need to explicitly specify the type of rule, reassuring Parboiled. Since we are writing a recognizer, the rule type will always be Rule0 (more on rule types will be in the next article).

So, we have everything, all we need is the entry point, or root, which is also nothing more than a list of nodes, for which we already have a rule ready. We use it, not forgetting to take into account possible spaces and end the rule with a symbol EOI:

def Root: Rule0 = rule { Nodes ~ EOI }

So we wrote a recognizer. You can read its full source code here .

Since it is very rare to only compare values in practice, and to extract them from the text all the time, in the next article I will tell you fascinating stories about how this is done, as well as about the types of rules. In it, we will bring our recognizer to the state of a full-fledged parser.

Tags: