ppopoff November 5, 2015 at 15:34

About Parboiled

From the sandbox

Part 1. Why Parboiled?

Today, in the light of the booming popularity of functional programming languages, parser combinators are increasingly finding use - tools that make it easy for simple mortals to parse text. Libraries such as Parsec (Haskell) and Planck (OCaml) have already proven their worth in their ecosystems. Their convenience and capabilities at one time prompted the creator of the Scala language, Martin Odersky, to add their counterpart, Scala Parser Combinators (now placed in scala-modules ), to the standard library , and the knowledge and ability to use such tools should be attributed to the mandatory requirements for Scala developers A3 level .

This series of articles is dedicated to the Parboiled library .- A powerful alternative and possible replacement for Scala Parser Combinators. In it, we will consider in detail the work with the current version of the library - Parboiled2, and also pay attention to Parboiled1, since most of the existing code still uses it.

Cycle structure:

Introduction

Parboiled is a library that makes it easy to parse (parse) markup languages (such as HTML, XML or JSON), programming languages, configuration files, logs, text protocols and generally anything text. Parboiled will come in handy if you want to develop your own domain-specific language ( DSL ): with its help you can quickly get an abstract syntax tree and, remembering the pattern of the interpreter , execute the commands of your domain language.

At the moment there are several versions of this library:

Parboiled for Java is the very first version of the library. Written by Matthias Doeniz in Java and for Java. It is still popular, although it is in the “end of life” state. If you accidentally inherited it, or if you deliberately start a Java project, I advise you to consider grappa as an alternative to the fork Parboiled1, which is carefully maintained by the user with the nickname fge .
Parboiled - the library, now better known as Parboiled1, was born after Matthias penetrated the rock. He made the Scala frontend for Parboiled, at the same time abandoning support for the Java version. With the release of Parboiled2, the Scala version of Parboiled1 is no longer supported, but in spite of this, it is not worth writing off from the accounts so far:
- Parboiled2 has not yet learned all the features of Parboiled1;
- Parboiled1 is still used much more widely than Parboiled2, so if you are suddenly transferred to some old Scala project, there is a high chance of encountering it.
Parboiled2 - the latest version of the library, eliminating a number of shortcomings PB1. It works faster and, most importantly, is supported by developers.

I wrote this article with a focus on Parboiled2 (by the way, I will continue to write about it masculine, without the word “library”), but sometimes I will be distracted to talk about the important differences between the first and second versions.

Key features

Brief description of Parboiled2:

Follows the principles of PEG .
Generates single-pass parsers. A separate lexer is not required.
Uses type-safe DSL, which is a subset of the Scala language.
Optimizations are performed at the compilation stage.

In practice, this means:

You do not need to write a parser with your bare hands.
Readability comparable to the best BNF grades (in my opinion, PB is even cooler).
You can use the full power of PEG and freely parse recursive data structures, while regular expressions cannot by definition . Yes, with regular expressions you will not parse either JSON or even the simplest arithmetic expression, let alone programming languages. On StackOverflow there is a notorious quote in the subject :
Asking regexes to parse arbitrary HTML is like asking Paris Hilton to write an operating system.
Even if you need to parse a linear structure, Parboiled2 (when using proper optimizations) will run faster than regular expressions. Evidence is given in the next section.
Unlike parser generators such as ANTLR , you are freed from the troubles with separate code generation and its subsequent compilation. All Parboiled code is written in Scala, so you get syntax highlighting and type checking out of the box, as well as the absence of additional operations on grammar files, while the parser generated by ANTLR will have two phases of parsing. However, despite this, ANTLR is still more powerful, documented and more stable, and therefore may be preferable in many ( very non-trivial) cases.
Skalov parser-combinators are slow. Very slow. It’s indecently slow. Matthias ran parser performance comparisons for Jackson and JSON written with Parboiled, Parboiled2, and Scala Parser Combinators. Disappointing results for the latter can be found further in the text.
Unlike Language Workbenches , Parboiled is a small and easy-to-use library. You don’t need to download a poorly documented braking monster and spend precious hours of life exhausting the search for the right menus and buttons to describe a small DSL. On the other hand, you will not get a ready-made text editor with highlighting your DSL out of the box, instead you will have to write a plug-in for Vim, Emacs or your IDE yourself, but this does not make Parboiled a less worthy alternative for developing small subject-oriented languages.
Parboiled has successfully established itself in many projects , including the bloody enterprise.

New in version two

This section will mainly be useful and understandable to those who have already worked with the first version of the library. Beginners are most likely to return to this list after reading the entire series of articles.

First of all, Parboiled2 successfully eliminates a number of childhood diseases of the first version:

Now you can use the rules more capacious than Rule7. For this, the shapeless library with its famous was used HListами: now one rule can operate with a large number of values on the stack. It also means that Parboiled2 has an additional dependency that PB1 did not have - the shapeless library itself.
Missing designs added. So, in Parboiled1 it was impossible to specify a dynamic number of repetitions for a rule, nTimesand it was necessary to use a “softer” rule oneOrMore, which did not give the necessary accuracy of the grammar description.
Added built-in primitive terminals. A new class CharPredicatethat contains field such as AlphaNumeric, Hex, Printable, Visibleand others.
Added the ability to expand and narrow the predicate. The need to exclude several characters from the rule arose earlier, but only now it can be easily taken and done, rather than creating a white list of characters.

Moreover:

Parboiled2 uses macros, which allows you to generate grammar at compile time, rather than at run time, as it was in Parboiled1. This greatly increases the performance of your parser, as well as increases the number of checks. In this regard, the block rulebecame mandatory, although Parboiled1 allowed in some cases to do without it. You will notice this innovation first of all when you will migrate old code.
Improved error reporting system.
Support for scala.js has appeared . The demo project can be viewed here .

Performance comparisons

Parboiled1 is known for its slowness (in any case, with respect to the parsers generated by ANTLR), due to the fact that all actions for matching rules were performed in runtime and the compiler could not perform any significant optimizations on such a parser. In Parboiled2, productivity was paramount and many things were redone on macros, so the compiler got freedom of action during optimization, and the user got the long-awaited performance. Below we will demonstrate what good results the developers have achieved.

Parboiled versus straight-forward JSON parsers

Parboiled is a generalized tool for creating parsers, and as you know, a specialized tool is always better than a generalized one in solving its specialized task. In the Java world, there are a small number of JSON parsers, hand-written by ancient elven masters, and Alexander Myltsev (one of the developers of Parboiled2) checked how much Parboiled loses performance to these artifacts. The results were quite optimistic, especially in the case of Parboiled2.

  Тест-кейс                           │ Время, мс │
──────────────────────────────────────┼───────────┼─────────────────────────────────
  Parboiled1JsonParser                │     85.64 │ ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇
  Parboiled2JsonParser                │     13.17 │ ▇▇▇▇
  Json4SNative                        │      8.06 │ ██▍
  Argonaut                            │      7.01 │ ▇▇
  Json4SJackson                       │      4.09 │ ▇

Parboiled vs. regex

Thanks to the use of static optimizations, Parboiled2 is able to work much faster than regular expressions (at least those that come with the Java class library). Here is some evidence from the mailing list :

  Тест-кейс                           │ Время, мс │
──────────────────────────────────────┼───────────┼───────────────────────────────────
  Parboiled2 (warmup)                 │   1621.21 │ ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇
  Parboiled2                          │    409.16 │ ▇▇▇▇▇▇▇▇
  Parboiled2 w/ better types (warmup) │    488.92 │ ▇▇▇▇▇▇▇▇▇▇
  Parboiled2 w/ better types          │    134.68 │ ▇▇▇
  Regex (warmup)                      │    621.95 │ ▇▇▇▇▇▇▇▇▇▇▇▇
  Regex                               │    620.38 │ ▇▇▇▇▇▇▇▇▇▇▇▇

Parboiled vs Scala Parser Combinators

On the mailing list you can find another performance test , which is in good agreement with the first (about JSON) and contains data for comparison with Scala Parser Combinators. Everything is very, very sad.

  Тест-кейс                           │ Время, мс │
──────────────────────────────────────┼───────────┼─────────────────────────────────
  Parboiled1JsonParser                |     73.81 | ▇
  Parboiled2JsonParser                |     10.49 | ▎
  ParserCombinators                   |   2385.78 | ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇

What Parboiled Can't

Most articles about parser combinators start with exhausting explanations of what PEG is, what it is with and why you need to be afraid of it. In order to parse configs, thorough understanding of this is not necessary, but it is still worth knowing about the limitations of this type of grammar. So, Parboiled basically can not:

Parse left recursive grammars. This is beyond the power of all top-down parsers, which include PEG. However, left recursive grammar can be adapted .
Parse indentation-based grammars, such as Python or YAML. I can’t do this due to the fact that the generated parser is single-pass, without a separate lexer. Indentation analysis is performed at the stage of lexical analysis. This problem has a simple solution: write a preprocessor that will place virtual markers before ( INDENT) and after ( DEDENT) indentation. Parboiled1 has standard tools for this , but for Parboiled2 a similar procedure has yet to be performed independently.
Use streaming input. PEGs use return search; it's backtracking . Theoretically, this drawback can be eliminated by stream buffering, but nothing prevents you from writing a grammar in which you return to the very beginning. Therefore, for this idea to work in practice, it is necessary to learn to determine the grammar boundaries of chunks between which return is impossible. Matthias is very interested in developing this feature, so it may appear in future releases.

In the next part, I will talk about how custom grammar is described in Parboiled, and we will also write a simple recognizer for the tree-like format of configuration files.

Tags: