Parser with operator precedence

Original author: Douglas Crockford
  • Transfer
Douglas Crockford

2007-02-21

Introduction


In 1973, at the first annual symposium, “ Principles of Programming Languages ​​Symposium ”, Won Pratt introduced the article “ Top Down Operator Precedence ”. In this article, Pratt described a parsing method that combines the best aspects of recursive descent and the Floyd operator precedence method . Pratt's method is very similar to recursive descent, but requires less code and works much faster. Pratt stated that his method is easy to learn, implement and use, extremely effective and very flexible. Due to its dynamism, it can be used for extensible languages.

But if the method is really perfect, why do compiler developers still ignore it? In his article, Pratt suggested that BNF-grammars and their numerous modifications, as well as related theorems and automata, occupied a niche earlier and now impede the development of the theory of parsing in other directions.

There is another explanation: this method is most effective for dynamic, functional programming languages ​​and it is much more difficult to use it in a static, procedural language. Pratt illustrates his article with Lisp as an example and playfully constructs syntax trees using a stream of tokens. But parsing methods are not particularly valued in the community of Lisp programmers who preach Spartan abandonment of syntax. Since the creation of Lisp, there have been many attempts to give this language rich syntax in the ALGOL style: CGOL Pratt , Lisp-2 , MLISP , Dylan , Interlisp's Clisp , original McCarthy M-expressionsetc. But they all failed. For the Lisp community, program and data consistency was more important than expressive syntax. On the other hand, the vast majority of programmers love the syntax, which is why Lisp itself has not become popular. The Pratt method needs a dynamic language, but the community of dynamic languages ​​has historically not used the syntax that is so conveniently implemented by the Pratt method.

Javascript


The situation has changed with the advent of JavaScript. JavaScript is a dynamic, functional language, but syntactically it clearly belongs to the C family. It is a dynamic language, and its community loves the syntax.

JavaScript is also object oriented. Pratt's article anticipated an object-oriented approach, but lacked expressive notation for this. JavaScript is an ideal language for implementing the Pratt method. I’ll show you how to quickly and efficiently create JavaScript parsers.

One article is not enough to deal with JavaScript completely and perhaps we don’t want to, because in this language the devil will break his leg. But there are brilliant sides in it, quite worthy of consideration. We will create a parser that can handle Simplified JavaScript. And we will write this parser in Simplified JavaScript. Simplified JavaScript is all the best of the language, including:

  • Functions as objects of the first class. In Simplified JavaScript, functions are lambda expressions with lexical scope.
  • Dynamic objects with prototype inheritance. There are no classes. We can add a new member to the object using the usual assignment. An object can inherit the members of another object.
  • Literals for objects and arrays. To create new objects and arrays, this notation is very convenient. JavaScript literals have served as inspiration for the JSON format .

We will take advantage of JavaScript prototypes to create token objects that inherit from characters. Our implementation will need a method Object.create(creates a new object that inherits the members of an existing object) and a lexical analyzer, which creates an array of token objects on the input line. Moving through this array, we will build a parsing tree.

table of symbols


Each token, such as an operator or identifier, will be inherited from the character. We will write all our possible characters (which determine the types of language tokens) into the object symbol_table.

var symbol_table = {};

An object original_symbolis a prototype for all other characters. His methods are usually overloaded. The meaning of the methods nudand led, as well as the binding power, will be explained later in the Priorities section.

var original_symbol = {
    nud: function () {
        this.error("Undefined.");
    },
    led: function (left) {
        this.error("Missing operator.");
    }
};

Define a function that creates characters. It takes a symbol identifier ( id) and binding strength (an optional parameter bp, defaults to zero), and returns a symbol object for the given one id. If the character is already in symbol_table, the function returns it. Otherwise, it creates a new character that is inherited from original_symbol, stores it in the character table, and returns. The symbol object initially contains id, the value, the left binding force ( lbp) and everything that came from original_symbol.

var symbol = function (id, bp) {
    var s = symbol_table[id];
    bp = bp || 0;
    if (s) {
        if (bp >= s.lbp) {
            s.lbp = bp;
        }
    } else {
        s = Object.create(original_symbol);
        s.id = s.value = id;
        s.lbp = bp;
        symbol_table[id] = s;
    }
    return s;
};

Declare common delimiters and trailing characters.

symbol(":");
symbol(";");
symbol(",");
symbol(")");
symbol("]");
symbol("}");
symbol("else");

Символ (end) указывает на окончание потока лексем. Символ (name) — прототип для новых имён, к примеру, имён переменных. Я включил в их идентификаторы скобки, чтобы избежать возможных совпадений с пользовательскими лексемами.

symbol("(end)");
symbol("(name)");

Лексемы


Мы предполагаем, что исходный текст уже преобразован в массив tokens примитивных лексем, в которых содержится поле type ("name", "string", "number", или "operator"), и поле value (строка или число). Переменная token всегда ссылается на текущую лексему.

var token;

The function advancecreates a new token object from the next primitive token and assigns it to a variable token. If an optional parameter is passed id, the function checks that the token has the corresponding identifier. The prototype of a new token object is a character (name)in the current scope or a character from a character table. The aritynew token field will equal either "name", or "literal", or "operator". Subsequently, when we learn more about the role of the token in the program, this value may change to "binary", "unary"or "statement".

var advance = function (id) {
    var a, o, t, v;
    if (id && token.id !== id) {
        token.error("Expected '" + id + "'.");
    }
    if (token_nr >= tokens.length) {
        token = symbol_table["(end)"];
        return;
    }
    t = tokens[token_nr];
    token_nr += 1;
    v = t.value;
    a = t.type;
    if (a === "name") {
        o = scope.find(v);
    } else if (a === "operator") {
        o = symbol_table[v];
        if (!o) {
            t.error("Unknown operator.");
        }
    } else if (a === "string" || a ===  "number") {
        a = "literal";
        o = symbol_table["(literal)"];
    } else {
        t.error("Unexpected token.");
    }
    token = Object.create(o);
    token.value = v;
    token.arity = a;
    return token;
};

Area of ​​visibility


Most languages ​​have a notation for defining new characters (for example, variable names). In a simple language, when we meet a new word, we automatically identify it and put it in the symbol table. More complex languages ​​have a scope that allows the programmer to control access to the variable and its lifetime.

The scope is the part of the program in which the variable is defined and available. Scopes can be nested. A variable defined in a certain scope is not visible outside it.

We will store the current scope in a separate variable scope.

var scope;

An object original_scopeis a prototype for all objects representing a scope. It contains a method definethat allows you to define new variables. The method defineconverts the name token into a variable token. It throws an error if the variable has already been defined in scope or the name is a reserved word.

var itself = function () {
    return this;
};
var original_scope = {
    define: function (n) {
        var t = this.def[n.value];
        if (typeof t === "object") {
            n.error(t.reserved ?
                "Already reserved." :
                "Already defined.");
        }
        this.def[n.value] = n;
        n.reserved = false;
        n.nud      = itself;
        n.led      = null;
        n.std      = null;
        n.lbp      = 0;
        n.scope    = scope;
        return n;
    },

The method is findused to find the definition by name. It starts the search from the current scope and goes, if necessary, up the chain, ending with a table of characters. If the definition could not be found, it returns symbol_table["(name)"]. The method checks that the value with the given name is not equal undefined(which would mean accessing the undeclared name) and that it is not a function (which would indicate a collision with the inherited method).

    find: function (n) {
            var e = this, o;
            while (true) {
                o = e.def[n];
                if (o && typeof o !== 'function') {
                    return e.def[n];
                }
                e = e.parent;
                if (!e) {
                    o = symbol_table[n];
                    return o && typeof o !== 'function' ?
                            o : symbol_table["(name)"];
                }
            }
        },

The method popcloses the scope, replacing it with the parent.

    pop: function () {
        scope = this.parent;
    },

The method is reserveused to indicate that the given name is a reserved word in the current scope.

    reserve: function (n) {
        if (n.arity !== "name" || n.reserved) {
            return;
        }
        var t = this.def[n.value];
        if (t) {
            if (t.reserved) {
                return;
            }
            if (t.arity === "name") {
                n.error("Already defined.");
            }
        }
        this.def[n.value] = n;
        n.reserved = true;
    }
};

Now we need a strategy for processing reserved words. In some languages, words describing the structure of the program (for example, if) are reserved and cannot be used as variable names. The flexibility of our parser allows you to achieve more. For example, we can say that in any function, any name taken can be used either as a language operator or as a variable name. We will reserve words locally only after they have been used as reserved. This simplifies the life of the creator of the language, because adding new keywords will not break existing programs, and simplifies the life of programmers, because they are not disturbed by unnecessary restrictions on the use of names.

When we want to create a new scope for a function or block, we call a function new_scopethat creates a new instance of the prototype original_scope.

var new_scope = function () {
    var s = scope;
    scope = Object.create(original_scope);
    scope.def = {};
    scope.parent = s;
    return scope;
};

A priority


Token objects contain methods that allow you to make decisions about priorities, select other tokens and build trees (and in more complex projects, also check types, optimize and generate code). The main task of the priorities is as follows: for a given operand between two operators, determine whether the operand refers to the left operator or to the right one:

A  e B f

If A and B are operators, which operand belongs to e? In other words, we have to choose between (d A e) Bed and f
and d A (e Bed and f) .

Ultimately, the main difficulty with parsing is resolving this ambiguity. The token objects from our method store the binding strength (or priority level), and simple methods nud(null denotation, null correspondence) and led(left denotation, left correspondence). The method nuddoes not matter which tokens are to the left, and the method ledis important. The method is nudused by values ​​(variables and literals) and prefix operators. The method is ledused by infix and postfix operators. A token can have both methods defined nudand led. For example, minus ( -) can be either prefix (changing the sign of a number) or infix (subtraction), so both methods are defined for it.

Our parser uses the following binding forces:
0Operators without communication: ;etc.
10Assignment Operators: =etc.
20?:
thirty|| &&
40comparison operators: ===etc.
fifty+ -
60* /
70unary operators: !etc.
80. [ (

Expressions


The main component of the Pratt method is a function expression. It accepts the right binding force as an input, which indicates how actively the expression binds to the right tokens.

var expression = function (rbp) {
    var left;
    var t = token;
    advance();
    left = t.nud();
    while (rbp < token.lbp) {
        t = token;
        advance();
        left = t.led(left);
    }
    return left;
}

The function expressioncalls the nudcurrent token method token, which processes literals, variables, and prefix operators. Then, until the right binding force is less than the left binding force of the next token, the method is called on it led. This method handles infix and postfix operators. The process may be recursive, as methods nudand ledmay themselves cause expression.

Infix operators


An operator +is an infix operator, so it has a method ledthat turns a token object into a tree whose two branches ( firstand second) are operands to the left and right of the sign +. The method ledtakes the left operand as a parameter, and the right one is found using a call expression.

symbol("+", 50).led = function (left) {
    this.first = left;
    this.second = expression(50);
    this.arity = "binary";
    return this;
};

The symbol is *similar +except for the meaning idand strength of binding. It is more strongly connected with operands, therefore its binding strength is higher.

symbol("*", 60).led = function (left) {
    this.first = left;
    this.second = expression(60);
    this.arity = "binary";
    return this;
};

Not all infix operators will look the same, but many will. Therefore, we can simplify our work by defining a function infixthat helps us create infix operators. The function infixaccepts idbinding strength and optionally a function led. If the function is not defined, infixcreates a leddefault function , which is suitable in most cases.

var infix = function (id, bp, led) {
    var s = symbol(id, bp);
    s.led = led || function (left) {
        this.first = left;
        this.second = expression(bp);
        this.arity = "binary";
        return this;
    };
    return s;
}

Now we can describe infix operators in a more declarative style:

infix("+", 50);
infix("-", 50);
infix("*", 60);
infix("/", 60);

=== Is an exact comparison operator in JavaScript.

infix("===", 40);
infix("!==", 40);
infix("<", 40);
infix("<=", 40);
infix(">", 40);
infix(">=", 40);

The ternary operator accepts three expressions, separated by ?and :. This is not an ordinary infix operator, so you have to define a function here led.

infix("?", 20, function (left) {
    this.first = left;
    this.second = expression(0);
    advance(":");
    this.third = expression(0);
    this.arity = "ternary";
    return this;
});

The dot ( .) operator is used to refer to a member of an object. There must be a name to the right of it, but it will be used as a literal.

infix(".", 80, function (left) {
    this.first = left;
    if (token.arity !== "name") {
        token.error("Expected a property name.");
    }
    token.arity = "literal";
    this.second = token;
    this.arity = "binary";
    advance();
    return this;
});

The operator is [used to access a member of an object or an element of an array dynamically. The expression on the right should be followed by a closing bracket ].

infix("[", 80, function (left) {
    this.first = left;
    this.second = expression(0);
    this.arity = "binary";
    advance("]");
    return this;
});

All these infix operators are left-associative. We can also create right-associative operators (for example, logical || and &&), reducing the right binding power.

var infixr = function (id, bp, led) {
    var s = symbol(id, bp);
    s.led = led || function (left) {
        this.first = left;
        this.second = expression(bp - 1);
        this.arity = "binary";
        return this;
    };
    return s;
}

The operator &&returns the first operand, if it is false, otherwise it returns the second. The operator ||returns the first operand, if true, otherwise returns the second. The false value is 0, an empty string "", falseor null. Any other value (including any object) is considered true.

infixr("&&", 30);
infixr("||", 30);

Prefix operators


The code that we used for associative infographic operators can be adapted for prefix operators. Prefix operators are associative. The prefix has no left binding force, because it does not bind to anything on the left. Sometimes prefix operators are reserved words.

var prefix = function (id, nud) {
    var s = symbol(id);
    s.nud = nud || function () {
        scope.reserve(this);
        this.first = expression(70);
        this.arity = "unary";
        return this;
    };
    return s;
}
prefix("-");
prefix("!");
prefix("typeof");

The method nudfor the bracket (calls advance(")")to find the pair bracket ). The token itself (does not fall into the syntax tree, because it nudreturns only the contents of the brackets.

prefix("(", function () {
    var e = expression(0);
    advance(")");
    return e;
});

Assignment Operators


We could use a function to define assignment operators infixr. But we’d better write a separate function assignmentbecause we want to do something else: make sure that the expression on the left is an lvalue, that is, you can really assign something to it, and set the field assignmentso that in the future we can easily find all the assignments.
var assignment = function (id) {
    return infixr(id, 10, function (left) {
        if (left.id !== "." && left.id !== "[" &&
                left.arity !== "name") {
            left.error("Bad lvalue.");
        }
        this.first = left;
        this.second = expression(9);
        this.assignment = true;
        this.arity = "binary";
        return this;
    });
};
assignment("=");
assignment("+=");
assignment("-=");

Note: we implemented something like inheritance. The function assignmentreturns the result of the call infixr, and the infixrresult of the call symbol.

Constants


The function constantcreates language constants. The method nudturns the name token into a literal token.

var constant = function (s, v) {
    var x = symbol(s);
    x.nud = function () {
        scope.reserve(this);
        this.value = symbol_table[this.id].value;
        this.arity = "literal";
        return this;
    };
    x.value = v;
    return x;
};
constant("true", true);
constant("false", false);
constant("null", null);
constant("pi", 3.141592653589793);

A character (literal)is a prototype for all string and numeric literals. The nudliteral token method returns the token itself.

symbol("(literal)").nud = itself;

suggestions


In the original, the Pratt method was created for functional languages ​​where there are only expressions. Most popular languages ​​use statements that are not as difficult to embed as expressions. We can easily process sentences if we add a new method to the tokens: std(statement denotation, sentence matching). The method stdis similar to nud, but it is called only at the beginning of the sentence.

The function statementparses one sentence. If the current token contains a method std, the token is reserved and this method is called. Otherwise, we consider the sentence to be an expression ending with a semicolon. For reliability, we will consider expressions that are not assignments or function calls as errors.

var statement = function () {
    var n = token, v;
    if (n.std) {
        advance();
        scope.reserve(n);
        return n.std();
    }
    v = expression(0);
    if (!v.assignment && v.id !== "(") {
        v.error("Bad expression statement.");
    }
    advance(";");
    return v;
};

The function statementsparses sentences until it encounters a token (end)or }, which marks the end of the block. The function returns a sentence, an array of sentences, or nullif no sentences were found.

var statements = function () {
    var a = [], s;
    while (true) {
        if (token.id === "}" || token.id === "(end)") {
            break;
        }
        s = statement();
        if (s) {
            a.push(s);
        }
    }
    return a.length === 0 ? null : a.length === 1 ? a[0] : a;
};

The function is stmtused to add sentence characters to the character table. It takes a idfunction as parameters std.

var stmt = function (s, f) {
    var x = symbol(s);
    x.std = f;
    return x;
};

A block proposal is a list of sentences in braces for which a new scope is defined. There is no scope for blocks in normal JavaScript, but they will be in our Simplified JavaScript.

stmt("{", function () {
    new_scope();
    var a = statements();
    advance("}");
    scope.pop();
    return a;
});


The function blockparses the block.
var block = function () {
    var t = token;
    advance("{");
    return t.std();
};

A sentence vardefines one or more variables in the current block. The variable name may be followed by an equal sign =and the initial value of the variable.

stmt("var", function () {
    var a = [], n, t;
    while (true) {
        n = token;
        if (n.arity !== "name") {
            n.error("Expected a new variable name.");
        }
        scope.define(n);
        advance();
        if (token.id === "=") {
            t = token;
            advance("=");
            t.first = n;
            t.second = expression(0);
            t.arity = "binary";
            a.push(t);
        }
        if (token.id !== ",") {
            break;
        }
        advance(",");
    }
    advance(";");
    return a.length === 0 ? null : a.length === 1 ? a[0] : a;
});

A sentence whiledefines a cycle. It includes an expression in brackets and a block.

stmt("while", function () {
    advance("(");
    this.first = expression(0);
    advance(")");
    this.second = block();
    this.arity = "statement";
    return this;
});

A sentence ifcreates a conditional construct. If a symbol follows a block else, then we also analyze the next block or the next sentence if.

stmt("if", function () {
    advance("(");
    this.first = expression(0);
    advance(")");
    this.second = block();
    if (token.id === "else") {
        scope.reserve(token);
        advance("else");
        this.third = token.id === "if" ? statement() : block();
    } else {
        this.third = null;
    }
    this.arity = "statement";
    return this;
});

A sentence is breakused to complete a cycle ahead of time.

stmt("break", function () {
    advance(";");
    if (token.id !== "}") {
        token.error("Unreachable statement.");
    }
    this.arity = "statement";
    return this;
});

A sentence is returnused to exit a function. It may contain an optional expression (return value of the function).

stmt("return", function () {
    if (token.id !== ";") {
        this.first = expression(0);
    }
    advance(";");
    if (token.id !== "}") {
        token.error("Unreachable statement.");
    }
    this.arity = "statement";
    return this;
});

Functions


Functions are executable values. A function can have an optional name (so that it can call itself recursively), a list of parameter names in brackets, and a body - a list of sentences in curly brackets. A function has its own scope.

prefix("function", function () {
    var a = [];
    new_scope();
    if (token.arity === "name") {
        scope.define(token);
        this.name = token.value;
        advance();
    }
    advance("(");
    if (token.id !== ")") {
        while (true) {
            if (token.arity !== "name") {
                token.error("Expected a parameter name.");
            }
            scope.define(token);
            a.push(token);
            advance();
            if (token.id !== ",") {
                break;
            }
            advance(",");
        }
    }
    this.first = a;
    advance(")");
    advance("{");
    this.second = statements();
    advance("}");
    this.arity = "function";
    scope.pop();
    return this;
});

Functions are performed using the operator (. When called, you can specify a number of arguments. We will check the left operand to cut off situations where the value on the left cannot be a function.

infix("(", 80, function (left) {
    var a = [];
    if (left.id === "." || left.id === "[") {
        this.arity = "ternary";
        this.first = left.first;
        this.second = left.second;
        this.third = a;
    } else {
        this.arity = "binary";
        this.first = left;
        this.second = a;
        if ((left.arity !== "unary" || left.id !== "function") &&
                left.arity !== "name" && left.id !== "(" &&
                left.id !== "&&" && left.id !== "||" && left.id !== "?") {
            left.error("Expected a variable name.");
        }
    }
    if (token.id !== ")") {
        while (true)  {
            a.push(expression(0));
            if (token.id !== ",") {
                break;
            }
            advance(",");
        }
    }
    advance(")");
    return this;
});

A symbol thisis a special variable. When a method is called, an object reference is stored in it.

symbol("this").nud = function () {
    scope.reserve(this);
    this.arity = "this";
    return this;
};

Literals for objects and arrays


An array literal is a set of expressions in square brackets separated by commas. Each expression is evaluated, and all results form a new array.

prefix("[", function () {
    var a = [];
    if (token.id !== "]") {
        while (true) {
            a.push(expression(0));
            if (token.id !== ",") {
                break;
            }
            advance(",");
        }
    }
    advance("]");
    this.first = a;
    this.arity = "unary";
    return this;
});

A literal for an object is a set of pairs in curly brackets, separated by commas. A pair consists of a key and an expression, which are separated by a colon ( :). A key is a literal or name that is interpreted as a literal.

prefix("{", function () {
    var a = [];
    if (token.id !== "}") {
        while (true) {
            var n = token;
            if (n.arity !== "name" && n.arity !== "literal") {
                token.error("Bad key.");
            }
            advance();
            advance(":");
            var v = expression(0);
            v.key = n.value;
            a.push(v);
            if (token.id !== ",") {
                break;
            }
            advance(",");
        }
    }
    advance("}");
    this.first = a;
    this.arity = "unary";
    return this;
});

What to think about and what to do


Созданное дерево можно передать в генератор кода или интерпретатор. Для создания дерева требуется минимум вычислений. И, как мы видим, требуется не так уж много усилий от программиста, чтобы написать такой парсер.

Мы можем добавить в функцию infix параметр кода операции, который поможет генератору кода. Мы можем также передавать туда дополнительные методы для свёртки констант и генерации кода.

Мы можем добавить другие предложения (например, for, switch и try), метки, больше кода для проверок на ошибки, восстановления после ошибок и кучу новых операторов. Мы можем добавить задание и вывод типов.

We can make our language extensible. We can let a programmer declare new statements and statements as easily as declare new variables.

Try the parser described in this article yourself.

You can find another example of using this parsing method in a JSLint project .

From a translator: I picked JSLint sources and decided that the Russian translation of this wonderful article would not hurt. The JSLint parser is truly extremely clear, powerful and easily extensible. Many thanks to KVie for editing the translation.

This article was published as part of the book Beautiful Code (ninth chapter). Russian translation of the entire book in electronic form can be purchased on the website of the publishing house "Peter".

Also popular now: