How to write parsers in JavaScript

Published on May 26, 2014

How to write parsers in JavaScript

    ... namely, how to write LL parsers for not very complex structures by constructing a complex parser from simpler ones. Occasionally, there is a need to parse something simple, say some XML-like structure or some kind of data URL, and then usually there is either a sheet of cunning hard-to-read code or dependence on some even more complex and tricky library for parsing. Here I am going to combine several well-known ideas (some of which came across on Habré) and show how to write quite complex parsers as simple and concise as possible while keeping in very few lines of code. For example, I will write an XML-like structure parser. And yes, I will not insert a picture here to attract attention. There are no pictures at all in the article, so it will be difficult to read.



    main idea



    It lies in the fact that each piece of input text is parsed by a separate function (let's call it a “pattern”) and by combining these functions you can get more complex functions that can parse more complex texts. So, a pattern is an object that has an exec method that does parsing. This function indicates what and where to parse from and it returns the parsed and the place where the parsing ended:

    var digit = {
        exec: function (str, pos) {
            var chr = str.charAt(pos);
            if (chr >= "0" && chr <= "9") 
                return { res: +chr, end: pos + 1};
        }
    };
    


    Now digit is a parsing digit pattern and can be used like this:

    assert.deepEqual(digit.exec("5", 0), { res: 5, end: 1 });
    assert.deepEqual(digit.exec("Q", 0), void 0);
    


    Why such an interface? Because JS has a built-in RegExp class with a very similar interface. For convenience, we introduce the Pattern class (look at it as an analog of RegExp) whose instances will be these patterns:

    function Pattern(exec) {
        this.exec = exec;
    }
    


    Then we introduce some simple patterns that are useful in a more or less complex parser.

    Simple patterns



    The simplest pattern is txt - it parses a fixed, predetermined line of text:

    function txt(text) {
        return new Pattern(function (str, pos) {
            if (str.substr(pos, text.length) == text)
                return { res: text, end: pos + text.length };
        });
    }
    


    It is applied as follows:

    assert.deepEqual(txt("abc").exec("abc", 0), { res: "abc", end: 3 });
    assert.deepEqual(txt("abc").exec("def", 0), void 0);
    


    If JS had default constructors (as in C ++), then the txt (“abc”) entry could be shortened to “abc” in the context of constructing a more complex pattern.

    Then we introduce its analogue for regular expressions and call it rgx:

    function rgx(regexp) {
        return new Pattern(function (str, pos) {
            var m = regexp.exec(str.slice(pos));
            if (m && m.index === 0)
                return { res: m[0], end: pos + m[0].length };
        });
    }
    


    Apply it like this:

    assert.deepEqual(rgx(/\d+/).exec("123", 0), { res: "123", end: 3 });
    assert.deepEqual(rgx(/\d+/).exec("abc", 0), void 0);
    


    Again, if JS had default constructors, then the rgx (/ abc /) entry could be shortened to / abc /.

    Combinator Patterns



    Now you need to enter several patterns that combine existing patterns. The simplest of these “combinators” is opt - it makes any pattern optional, i.e. if the original pattern p cannot parse the text, then opt (p) on the same text will say that everything is parsed, only the parsing result is empty:

    function opt(pattern) {
        return new Pattern(function (str, pos) {
            return pattern.exec(str, pos) || { res: void 0, end: pos };
        });
    }
    


    Usage example:

    assert.deepEqual(opt(txt("abc")).exec("abc"), { res: "abc", end: 3 });
    assert.deepEqual(opt(txt("abc")).exec("123"), { res: void 0, end: 0 });
    


    If operator overloading and default constructors were possible in JS, then the record opt (p) could be reduced to p || void 0 (this is clearly seen from how opt is implemented).

    The next most complex combinator is exc - it parses only what can parse the first pattern and cannot parse the second:

    function exc(pattern, except) {
        return new Pattern(function (str, pos) {
            return !except.exec(str, pos) && pattern.exec(str, pos);
        });
    }
    


    If W (p) is the set of texts that p pattern pars, then W (exc (p, q)) = W (p) \ W (q). This is convenient, for example, when you need to parse all capital letters except the letter H:

    var p = exc(rgx(/[A-Z]/), txt("H"));
    assert.deepEqual(p.exec("R", 0), { res: "R", end: 1 });
    assert.deepEqual(p.exec("H", 0), void 0);
    


    If there was operator overloading in JS, then exc (p1, p2) could be reduced to p1 - p2 or to! P2 && p1 (for this, however, it would be necessary to introduce a combinatorial pattern all / and which will work as the && operator) .

    Then comes the any combinator pattern - it takes several patterns and constructs a new one that parses what the first of these patterns parses. We can say that W (any (p1, p2, p3, ...)) = W (p1) v W (p2) v W (p3) v ...

    function any(...patterns) {
        return new Pattern(function (str, pos) {
            for (var r, i = 0; i < patterns.length; i++)
                if (r = patterns[i].exec(str, pos))
                    return r;
        });
    }
    


    I used the ... patterns (harmony: rest_parameters) construct to avoid clumsy code like [] .slice.call (arguments, 0). An example of using any:

    var p = any(txt("abc"), txt("def"));
    assert.deepEqual(p.exec("abc", 0), { res: "abc", end: 3 });
    assert.deepEqual(p.exec("def", 0), { res: "def", end: 3 });
    assert.deepEqual(p.exec("ABC", 0), void 0);
    


    If there were operator overloading in JS, then any (p1, p2) could be reduced to p1 || p2.

    The next combinator pattern is seq - it sequentially parses the text given to it by a sequence of patterns and produces an array of results:

    function seq(...patterns) {
        return new Pattern(function (str, pos) {
            var i, r, end = pos, res = [];
            for (i = 0; i < patterns.length; i++) {
                r = patterns[i].exec(str, end);
                if (!r) return;
                res.push(r.res);
                end = r.end;
            }
            return { res: res, end: end };
        });
    }
    


    It is applied as follows:

    var p = seq(txt("abc"), txt("def"));
    assert.deepEqual(p.exec("abcdef"), { res: ["abc", "def"], end: 6 });
    assert.deepEqual(p.exec("abcde7"), void 0);
    


    If there were operator overloading in JS, then seq (p1, p2) could be reduced to p1, p2 (the comma operator is overloaded).

    And finally, the rep combinator pattern - it applies the well-known pattern to the text many times and produces an array of results. As a rule, this is used for parsing certain structures of the same type, separated by, say, commas, therefore rep takes two arguments: the main pattern whose results are interesting to us and the separating pattern whose results are discarded:

    function rep(pattern, separator) {
        var separated = !separator ? pattern :
            seq(separator, pattern).then(r => r[1]);
        return new Pattern(function (str, pos) {
            var res = [], end = pos, r = pattern.exec(str, end);
            while (r && r.end > end) {
                res.push(r.res);
                end = r.end;
                r = separated.exec(str, end);
            }
            return { res: res, end: end };
        });
    }
    


    You can add a couple more parameters min and max that will control how many repetitions are allowed. Here I used the arrow function r => r [1] (harmony: arrow_functions) to not write function (z) {return z [1]}. Note how rep reduces to seq using Pattern # then (idea taken from Promise # then):

    function Pattern(exec) {
        ...
        this.then = function (transform) {
            return new Pattern(function (str, pos) {
                var r = exec(str, pos);
                return r && { res: transform(r.res), end: r.end };
            });
        };
    }
    


    This method allows you to infer from another pattern by applying an arbitrary transformation to the results of the first. By the way, experts in Haskell, is it possible to say that this Pattern # then makes a monad from a pattern?

    Well, rep is applied this way:

    var p = rep(rgx(/\d+/), txt(","));
    assert.deepEqual(p.exec("1,23,456", 0), { res: ["1", "23", "456"], end: 8 });
    assert.deepEqual(p.exec("123ABC", 0), { res: ["123"], end: 3 });
    assert.deepEqual(p.exec("ABC", 0), void 0);
    


    Any clear analogy with operator overloading for rep does not occur to me.

    The result is about 70 lines for all of these rep / seq / any. This completes the list of combinator patterns, and we can proceed to the actual construction of a pattern that recognizes XML-like text.

    Parser for XML-like texts



    We restrict ourselves to such XML-like texts:

    <?xml version="1.0" encoding="utf-8"?>
    <book title="Book 1">
      <chapter title="Chapter 1">
        <paragraph>123</paragraph>
        <paragraph>456</paragraph>
      </chapter>
      <chapter title="Chapter 2">
        <paragraph>123</paragraph>
        <paragraph>456</paragraph>
        <paragraph>789</paragraph>
      </chapter>
      <chapter title="Chapter 3">
        ...
      </chapter>
    </book>
    


    To begin with, we will write a pattern that recognizes a named attribute of the form name = "value" - it is obviously often found in XML:

    var name = rgx(/[a-z]+/i).then(s => s.toLowerCase());
    var char = rgx(/[^"&]/i);
    var quoted = seq(txt('"'), rep(char), txt('"')).then(r => r[1].join(''));
    var attr = seq(name, txt('='), quoted).then(r => ({ name: r[0], value: r[2] }));
    


    Here attr parses a named attribute with a value in the form of a string, quoted - parses a string in quotation marks, char - parses a single letter in a string (why should this be written as a separate pattern? Then, then "teach" this char to parse the so-called xml entities ), but name parses the attribute name (note that it parses both capital and small letters, but returns the parsed name where all letters are small). The attr application looks like this:

    assert.deepEqual(
        attr.exec('title="Chapter 1"', 0), 
        { res: { name: "title", value: "Chapter 1" }, end: 17 });
    


    Next, we construct a pattern that can parse a header like <? Xml ...?>:

    var wsp = rgx(/\s+/);
    var attrs = rep(attr, wsp).then(r => {
        var m = {}; 
        r.forEach(a => (m[a.name] = a.value));
        return m;
    });
    var header = seq(txt('<?xml'), wsp, attrs, txt('?>')).then(r => r[2]);
    


    Here wsp parses one or more spaces, attrs parses one or more named attributes and returns the parsed one in the form of a dictionary (rep would return an array of name-value pairs, but the dictionary is more convenient, therefore the array is converted to a dictionary inside then), and header parses the header itself and returns only the title attributes in the form of the same dictionary:

    assert.deepEqual(
        header.exec('<?xml version="1.0" encoding="utf-8"?>', 0),
        { res: { version: "1.0", encoding: "utf-8" }, end: ... });
    


    Now let's move on to parsing constructions of the form <node ...> ...:

    var text = rep(char).then(r => r.join(''));
    var subnode = new Pattern((str, pos) => node.exec(str, pos));
    var node = seq(
        txt('<'), name, wsp, attrs, txt('>'),
        rep(any(text, subnode), opt(wsp)),
        txt('</'), name, txt('>'))
        .then(r => ({ name: r[1], attrs: r[3], nodes: r[5] }));
    


    Here text parses the text inside the node and uses a char pattern that can be learned to recognize xml entities, subnode parses the internal node (actually subnode = node) and node parses the node with attributes and internal nodes. Why such a tricky definition of subnode? If you refer to node directly in the definition of node (something like this: node = seq (..., node, ...)), then it turns out that at the time of the definition of node this variable is still empty. The subnode trick eliminates this circular dependency.

    It remains to determine the pattern that recognizes the entire file with the header:

    var xml = seq(header, node).then(r => ({ root: r[1], attrs: r[0] }));
    


    The application is accordingly:

    assert.deepEqual(
        xml.exec(src), {
            attrs: { version: '1.0', encoding: 'utf-8' },
            root: {
                name: 'book',
                attrs: { title: 'Book 1' },
                nodes: [
                    {
                        name: 'chapter',
                        attrs: { title: 'Chapter 1' },
                        nodes: [...]
                    },
                    ...
                ]
            }
        });
    


    Here I call Pattern # exec with one argument and the meaning of this is that I want to parse the string from the very beginning and make sure that it is parsed to the end, but since it is parsed to the end, it is enough to return only the parsed one without a pointer to that place where the parser stopped (I already know that this is the end of the line):

    function Pattern(name, exec) {
        ...
        this.exec = function (str, pos) {
            var r = exec(str, pos || 0);
            return pos >= 0 ? r : !r ? null : r.end != str.length ? null : r.res;
        };
    }
    


    Actually the entire parser is in 20 lines (do not forget about those 70 that implement rep, seq, any, etc.):

    var name = rgx(/[a-z]+/i).then(s => s.toLowerCase());
    var char = rgx(/[^"&]/i);
    var quoted = seq(txt('"'), rep(char), txt('"')).then(r => r[1].join(''));
    var attr = seq(name, txt('='), quoted).then(r => ({ name: r[0], value: r[2] }));
    var wsp = rgx(/\s+/);
    var attrs = rep(attr, wsp).then(r => {
        var m = {}; 
        r.forEach(a => (m[a.name] = a.value));
        return m;
    });
    var header = seq(txt('<?xml'), wsp, attrs, txt('?>')).then(r => r[2]);
    var text = rep(char).then(r => r.join(''));
    var subnode = new Pattern((str, pos) => node.exec(str, pos));
    var node = seq(
        txt('<'), name, wsp, attrs, txt('>'),
        rep(any(text, subnode), opt(wsp)),
        txt('</'), name, txt('>'))
        .then(r => ({ name: r[1], attrs: r[3], nodes: r[5] }));
    var xml = seq(header, node).then(r => ({ root: r[1], attrs: r[0] }));
    


    With operator overloading in JS (or C ++), it would look something like this:

    var name = rgx(/[a-z]+/i).then(s => s.toLowerCase());
    var char = rgx(/[^"&]/i);
    var quoted = ('"' + rep(char) + '"').then(r => r[1].join(''));
    var attr = (name + '=' + quoted).then(r => ({ name: r[0], value: r[2] }));
    var wsp = rgx(/\s+/);
    var attrs = rep(attr, wsp).then(r => {
        var m = {}; 
        r.forEach(a => (m[a.name] = a.value));
        return m;
    });
    var header = ('<?xml' + wsp + attrs + '?>').then(r => r[2]);
    var text = rep(char).then(r => r.join(''));
    var subnode = new Pattern((str, pos) => node.exec(str, pos));
    var node = ('<' + name + wsp + attrs + '>' + rep(text | subnode) + (wsp | null) + '</' + name + '>')
        .then(r => ({ name: r[1], attrs: r[3], nodes: r[5] }));
    var xml = (header + node).then(r => ({ root: r[1], attrs: r[0] }));
    


    It is worth noting that each var here strictly corresponds to one ABNF rule and therefore if you need to parse something according to the description in the RFC (and there they like ABNF), then transferring those rules is a mechanical matter. Moreover, since the ABNF rules themselves (as well as EBNF and PEG) are strictly formal, you can write a parser for these rules and then, instead of calling rep, seq, etc., write something like this:

    var dataurl = new ABNF('"data:" mime ";" attrs, "," data', {
        mime: /[a-z]+\/[a-z]+/i,
        attrs: ...,
        data: /.*/
    }).then(r => ({ mime: r[1], attrs: r[3], data: r[5] }));
    


    And apply as usual:

    assert.deepEqual(
        dataurl.exec('data:text/plain;charset="utf-8",how+are+you%3f'),
        { mime: "text/plain", attrs: { charset: "utf-8" }, data: "how are you?" });
    


    A few more beeches



    Why does Pattern # exec return null / undefined if parse failed? Why not throw an exception? If you use exceptions this way, then the parser will become slower every twenty. Exceptions are good for exceptional cases.

    The described method allows you to write LL parsers that are not suitable for all purposes. For example, if you need to parse XML of the form

    <book attr1="..." attr2="..." очень много атрибутов ???
    


    That in the place where it stands ??? it may turn out to be both / and simply>. Now if the LL parser tried to parse all this as <book ...> but in place ??? turned out to be>>, then the parser will discard everything that it took so long to parse and start over again under the assumption that this is <book ... />. LR parsers are free from this drawback, but writing them is difficult.

    Also LL parsers are poorly suited for parsing a variety of syntactic / mathematical expressions where there are operators with different priorities, etc. LL parser of course you can write, but it will be somewhat confusing and will work slowly. The LR parser will be confusing on its own, but will be fast. Therefore, such expressions are convenient to parse the so-called Pratt’s algorithm which is well explainedCrockford (if this link is purple, then you probably understand parsers better than me and you probably were very bored to read all this).

    I hope someone comes in handy. At one time, I wrote parsers of varying degrees of clumsiness, and the method described above was a discovery for me.