ppopoff November 16, 2015 at 22:34

Pro Parboiled (Part 4 Final)

Part 4. Harsh reality

How to make Parboiled work even faster? What mistakes are best avoided? What to do with the inheritance in the form of Parboiled1? The concluding article of the series is called upon to answer these, as well as other questions.

Cycle structure:

Performance

Parboiled2 is fast, but sometimes it can work even faster. In this section, we will talk about the available microoptimizations. The main thing when performing optimizations is timeliness. But if you can immediately write a slightly more optimal code without losing expressiveness, you should definitely use this opportunity.

Expand `n.times`for small n <= 4

You can gain in performance if, for small n, instead of the repetition operator, n.timessimply chain several repeating rules into a chain. How many repetitions it makes sense to deploy - depends on the circumstances, but this number is hardly more than four.

// Медленно
rule { 4 times Digit }
// Быстро
rule { Digit ~ Digit ~ Digit ~ Digit }

The relevance of this optimization was announced by Matthias himself, although, hypothetically, the operator n.timescould have performed it himself.

Speeding up stack operations for `n.times`

Using a similar technique will allow you to squeeze out a bit of performance when extracting data from the value stack. For example, so it can be applied to the previous rule:

def Digit4 = rule {
  Digit ~ Digit ~ Digit ~ Digit ~
    push(
      #(charAt(-4))*1000 +
      #(charAt(-3))*100 +
      #(charAt(-2))*10 +
      #(lastChar)
    )
}

Do not recreate `CharPredicate`

It is perfectly normal to rejoice at the new features of the class CharPredicate, but you should not create your own instances of the type CharPredicateinside the block rule: your predicate will be recreated every time a rule is fulfilled that will dramatically ruin the performance of your parser. Therefore, instead of creating symbolic predicates every time, define them inside your parser as a constant:

class MyParser(val input: ParserInput) extends Parser {
  val Uppercase = CharPredicate.from(_.isUpper)
  ...
}

or, even better, send this declaration to the companion object of your parser:

class MyParser(val input: ParserInput) extends Parser {
  ...
}
object MyParser {
  val Uppercase = CharPredicate.from(_.isUpper)
}

Use semantic predicates

The peculiarity of these rules is that they do not interact with the value stack. In detail, they are described in the documentation, but here is the most important thing that you should know about them:

When using semantic predicates, the parser does not make progress, that is, it does not move its cursor to the next character. Therefore, when they are thoughtlessly used, the parser can loop.

Remember the symbolic predicate example for uppercase characters? You can do the same using a semantic predicate test:

def JavaUpperCase = rule { oneOrMore(test(currentChar.isUpper) ~ ANY) }

Use `ANY`where you would like to see`CharPredicate.All`

Alas, it CharPredicate.Allworks slowly for large ranges of characters, it ANYworks faster. Take advantage of this knowledge.

Use an invert predicate

Imagine that your parser should capture all characters before the line feed (for definiteness, in Unix style). Of course, this can be done using noneOf, but the inverting predicate will be faster:

def foo = rule { capture(zeroOrMore(noneOf("\n"))) }
// Быстрее?
def foo = rule { capture(zeroOrMore(!'\n')) }

Unfortunately, this great looking example will loop, because the parser will not make progress. To fix this, you need a rule that moves the parser cursor, but does not change the stack. For example, here is this:

def foo = rule { capture(zeroOrMore( !'\n' ~ ANY )) }

Now the rule foowill absorb absolutely everything except EOIline feed.

Bug reports

I don’t think that you will want to work with a parser that produces meaningless messages for any incorrect input. Parboiled2 is able to tell quite clearly about errors if you help him with this.

Formatting

So, if something is screwed up, the parser will pass at your disposal an object of the type ParseErrorthat can be brought into readable form using the method formatError:

val errorMessage = parser formatError error

If the default formatting does not suit you for some reason, you should pass your wishes to the parser explicitly:

val errorMessage parser.formatError(error, new ErrorFormatter(showTraces = true))

If you want to write your own ErrorFormatter, you will have to deal with the structure of the class ParseError, which is declared in the depths of Parboiled this way:

case class ParseError(position: Position, charCount: Int, traces: Seq[RuleTrace]) extends RuntimeException

It is also worth noting the presence of several schemes for delivering error messages to the user: at your request, it ParseErrorcan be represented not only as an object Try, but, for example, as a polymorphic type or Either. More details can be found here .

def Foo = rule { "foo" | fail("Я упаль!") }

Fine tuning

There is an option to circumvent the built-in error reporting mechanism. To do this, use the rule failwith the message that you want to see in case of an error:

def Goldfinger = rule { "talk" | fail("to die") }

Then, when the opportunity arises, you will receive back your error message in approximately the following form:

Invalid input 'Bond', expected to die. (line 1, column 1):

Named Rules

Using this type of rule can be very useful not only for catching errors. This mechanism is described in detail in the "Best Practices" section.

atomic

Parboiled2 generates PEG-based parsers. This means that parsers operate on characters, not strings (as many might think), therefore errors will also be shown to you at the character level. Agree - a message like “You have X here, we expected Y or Z” will require more mental effort than “You have XX here, and we expected to see XY or XZ”. In order to see the entire line in the error reports, there is a marker atomiс; for this, you just need to wrap the rule in it:

def AtomicRuleTest = rule { atomic("foo") | atomic("fob") | atomic("bar") }

To get chanterelles ( foxes) at the input

Invalid input "fox", expected "foo", "fob" or "bar" (line 1, column 1):
foxes
^

quiet

When there are too many options to choose, you don’t always want to notify the user of all possible alternatives. For example, in a certain place, your parser expects a lot of whitespace in conjunction with a certain rule. To eliminate redundancy in a report, you might want to keep silent about spaces. Using a marker quietis very simple:

def OptionalWhitespaces = rule { quiet(zeroOrMore(anyOf(" \t\n"))) }

Honestly, I have not encountered situations that encourage the use of this rule. Just like atomic, it is described in detail in the documentation .

Error recovery

Almost the only episode where Parboiled1 wins, while Parboiled2 is not doing very well: the parser falls already only from the sight of the first error it encounters. For most scenarios, this is great: for example, it does not interfere with parsing logs, text protocols, configuration files (for a number of cases), however, developers of DSL or IDE-like tools will not like this situation. Matthias promises to fix this , so if you really need this functionality today - write to the bug tracker, maybe this will speed up the development process.

In Parboiled1 there is a huge number of ParserRunners for all occasions. Look to the side RecoveringParserRunnerif you need to continue parsing in case of errors.

Testing

Parboiled developers use to test the framework specs2 , which they supplemented their helper class TestParserSpec . It will seem inconvenient to those who use scalatest, but its basic idea can be adopted. In secret from Matthias, his decision is not particularly accurate, as it relies on a mutable state. Perhaps in the future we will be waiting for something similar to a full-fledged framework for testing.

Rules can be tested both individually and together. Personally, I prefer to write tests not for each rule, but only check the main rule in “special” cases:

In many formats, even standardized ones, very interesting points may occur. For example, in a BSD-like RFC 3164 message format, two positions are always assigned to the day of the month , even if the number itself has one bit. Here is an example from the RFC itself:

If the day of the month is less than 10, then it MUST be represented as a space and then the number. For example, the 7th day of August would be represented as "Aug 7", with two spaces between the "g"and the "7".

In addition to this kind of "interesting points", you can feed the parser strings with open brackets, invalid characters, check the order of operations with the value stack.

In testing, there is another subtlety that you will immediately encounter. Suppose you want to test the following rule:

def Decimal: Rule0 = rule {
  ("+" | "-").? ~ Digit.+ ~ "." ~ Digit.+
}

To do this, we will send the obviously incorrect input to the parser and we will wait for the error at the output:

// Я еще не видел десятичных дробей с двумя разделителями.
val p = new MyParser("12.3.456").Decimal.run()  // Success(())
p.isFailure shouldBe true  // тест упадет

But when running the test, it turns out that the parser returned a successful result. Why is that? There is no rule in our rule EOI, but if we add to it EOI, we will ruin all the rules that they use Decimal. Therefore, you will have to create a special testing rule, for example, using the cunning meta-rules mechanism . Let's add an EOI at the end of our previous example, and make sure that the parser crashes with an error:

Failure(ParseError(Position(5,1,6), Position(5,1,6), <2 traces>))

Disadvantages of Parboiled

Parboiled2

If people have flaws, why don't libraries have them? Here, Parboiled2 is no exception.

Long, too general and completely incomprehensible compiler error messages, in the best traditions of C ++. A good example is shown in the figure below (an operator is accidentally omitted in the rule ~). The reason is related to performing advanced checks on types that promise to be removed in future versions.

The compiler swears dirty

This problem no longer applies to Parboiled2, but to scalac. The compiler can be torn down if the types of arguments are explicitly (not) defined for a lambda that grabs values from the stack:
```
// Может не сработать
def MyRule = rule { oneOrMore(Visible) ~> {s => "[" + s + "]"} }
// Скорее всего сработает
def MyRule = rule { oneOrMore(Visible) ~> {s: String => "[" + s + "]"} }
```
What works and what doesn't depends on the version of your compiler.
Многие IDE еще не научились поддерживать макровыражения, а Parboiled2 был построен не без их помощи. Поэтому не стоит верить подчеркиваниям вашей среды разработки. Однажды я, забыв об этом, потратил целый день на поиск несуществующей ошибки буквально на ровном месте.
Отсутствие механизма восстановления при неудачном разборе. Проектирующих предметно-ориентированные языки, или же тех, кто хочет использовать Parboiled2 в качестве фронтэнда к своему компилятору, это сильно разочарует. Но над этим работают. Если вы хотите видеть эту возможность — пишите, это ускорит разработку.

I think that many developers of their small IDEs and text editors would like to see more flexible error messages than the ones that are provided now. At the moment, there are only two ways to influence them:
- named rules
- named nested rules.

Parboiled1

Most of the projects are still written in Parboiled1, and it is unlikely that anything will change dramatically and dramatically (in the enterprise), so it may be useful to know how to learn to put up with its shortcomings, which Parboiled1 has a lot of. In addition to the very limited DSL, Parboiled has a “Rule8” problem, which complicates writing a parser for logs. Parboiled1 constructed so that each rule with N elements has a class, by analogy with skalovskimi tuples (tuples): there Rule0, Rule1until theRule7. This is quite enough to parse complex programming languages such as Java, and indeed does not cause significant problems when parsing tree structures. But if you need to extract data from a linear structure, for example, log file messages, then this restriction is very easy to run into. This is solved by using a tuple instead of a single resulting rule. Here is an example:

def Event: Rule1[LogEvent] = rule {
  Header ~ " " ~ UserData ~ " " ~ Message ~~> {
    (header, data, message) => SyslogEvent (
      header._1, header._2, header._3, header._4, header._5, data._1, data._2, message
    )
  }
}

Let it look wretched, but the problem is solved.

Best practices

In this section, I will talk about common truths that work for any parser combinator, as well as about the nuances specific to Parboiled2.

Charutils

There is one useful object not covered in the documentation: CharUtils . It contains a number of static methods that can make your life easier, for example: changing the case of characters, escaping, converting integer values to their corresponding characters (strings). etc. Using it may save your time.

Write unit tests

One small, unsuccessful change can break your grammar and cause acute rectal pain. This is a banal advice that many neglect. The parser is not as difficult to test as, say, IO: you do not need Mock objects and other tricks for this routine, but very valuable work. We had a whole infrastructure of parsers. And believe me, the first thing I did when looking for errors was sitting down and writing tests, if there were none.

Make parsers and rules small

Separate your parsers into sub-parsers. Each component must do something very specific. For example, if you are parsing a LogEvent that has a Timestamp field defined (especially if this Timestamp corresponds to some Rfc), then do not be lazy and take it out separately.

Firstly, it will reduce the code of your main praser and make it more visual.
Secondly, it will greatly facilitate testing. You will cover your sub-parser with unit tests. And after that, start developing the main parser

There are different approaches:

Break the parser into traits and use the self-typed reference (I prefer this method).
Declare parsers as separate entities and use composition.
Use the built-in mechanism to create subParsers.

Rules should be as compact as possible, but not as compact. The smaller your rules, the easier it is to find a mistake in the grammar. It is very difficult to understand the logic of the developer if he makes the rules long, and at the same time reuses them capture. Implicit capture can aggravate a situation. Specifying the type of rule also helps with support.

Send case objects instead of strings in Value stack

This advice can be attributed to optimizations, as this will make the parser work faster. Send meaningful objects, not strings, to the Value stack. This will make your parser faster, and the code more clear.

Poorly:

def logLevel = rule {
  capture("info" | "warning" | "error") ~ ':’
}

Good:

def logLevel = rule {
    “info:”   ~ push(LogLevel.Info)
  | “warning" ~ push(LogLevel.Warning)
  | “error"   ~ push(LogLevel.Error)
}

Use simplified syntax to assemble an object

This beautiful way appeared in Parboiled1. No magic, just the case class constructor is invoked implicitly. The main thing is that the number and type of arguments placed on the Value Stack match the signature of the case class constructor.

Poorly:

def charsAST: Rule1[AST] = rule { capture(Characters) ~> ((s: String) => AText(s)) }

Good:

def charsAST = rule { capture(Characters) ~> AText }

Named rules

Named rules significantly simplify life when receiving error reports, as they make it possible to use an alias instead of an indistinct rule name. Or mark the rules with a specific tag - “This expression” or “Modifies the stack”. In any case, it will be useful to know about this function.

Many Parboiled1 users already love this feature. For example, Neo4J developers using Parboiled to parse Cypher .

How it looks in Parboiled1:

def Header: Rule1[Header] = rule("I am header") { ... }

In Parboiled2:

def Header: Rule1[Header] = namedRule("header is here") { ... }

It is also possible to give names to nested rules:

def UserName = rule { Prefix ~ oneOrMore(NameChar).named("username") ~ PostFix }

Migration

Migration is most often a simple process, but it takes a lot of time. Therefore, I will try to save at least a little precious hours of your life and describe the main pitfalls.

Classpath

In order to avoid conflicts with the first version, Parboiled2 uses the classpath org.parboiled2(whereas the classpath for the first version org.parboiled). Mavenovsky groupId, however, was old: org.parboiled. Due to this, it is possible to have both dependencies in one project and carry out a gradual transition to a new version. Which, by the way, works very well with several autonomous parsers. If your parsers consist of many modules reused in different places (as it was in my case) - you will have to do the migration immediately and for all modules.

Test verification

Verify the availability and performance of unit tests. Do you have them? Not? Write them. During the migration process, I had to refine some grammars because the new DSL became more powerful, and inadvertent changes broke the grammars. Falling tests saved a lot of time. I did not encounter serious problems, such as breaking the entire grammar, when migrating. Maybe someone will share their experience if this happened to him.

Code around the parser

Now the parser will be recreated every time, which is not always convenient. With PB1, I really liked to create a parser once, and then use it repeatedly. Now this number will not work. Therefore, you will have to change the constructor of the parser and rewrite the code that uses it a little, and do not be afraid that this will degrade performance.

Warning Parboiled1 allows you to generate rules at runtime. Therefore, if you have such a parser, then you most likely will have to rewrite it: Parboiled2 uses macro expressions that make the dynamics very difficult, giving better performance in return.

Composition

The approach to the composition of the parser elements has not changed, this is good news for migrants. However, Parsernow it’s not a trait, but an abstract class. Traits are the most convenient means of composing software components, in PB1 this allowed mixing Parserinto any modules, mixing the modules together. The change in favor of the abstract class did not affect this feature, but now you need to use the self-typed reference :

trait Numbers { this: Parser =>
  // Ваш код
}

Those who have not used this feature of the language and mixed trait each time Parserwill have to change their taste preferences.

As an alternative way, you can make full parsers from your traits and import the necessary rules from them (as methods) into your main parser. True, I still prefer to use trait compositions, because I find them more visual: I can more clearly see a parser assembled from pieces, instead of multiple imports.

Getting rid of primitives

During the migration process, be sure to arrange an audit of your personal library of primitive rules: delete everything that is in CharPredicate. Your library will lose weight, but it will not disappear at all. Many would like to add support for various date formats, a grammar describing email, HTTP headers to parboiled. Parboiled is simply a parser combinator: it was, it will remain so. However, agree that throwing away old code is very nice.

Conclusion

In this series of articles, I tried to tell you about the most progressive and promising parsing tool that exists for the scala language. I made a short tutorial and talked about the problems that I had to face in practice. I hope that this article will be useful for you in the worst case, and at best it will become a guide to action.

Used sources

Благодарности

I want to express my gratitude to Alexander and Matthias for having a reason for the article, as well as a convenient tool. Yana, thank you for the proofreading and editing of my many mistakes, I promise I will write more competently. Thanks to firegurafiku and Too Taboo for their help with the first article in Vertsk, proofreading, numerous corrections, and ideas for examples in subsequent ones. Thanks to Vlad Ledovsky for proofreading and corrections in the last article in the series. Thanks to nehaev , for the error found in the article’s code, and to Igor Kustov for the idea of breaking up a huge article into parts (I didn’t want to do this for a long time). Special thanks to Arseniy Alesandrovich primetalk for the inaccuracies found and
useful suggestions. Thanks to all those who followed the cycle of articles and reached the last. I hope the work was not done in vain.

Tags: