SLY_G November 15, 2014 at 00:33

Expressive JavaScript: Project: Programming Language

Transfer

Content

What checks and defines the meaning of expressions in a programming language is, in turn, just a program.

Hal Abelson and Gerald Sasman, “Structure and Interpretation of Computer Programs.”

When the teacher asked the teacher about the nature of the Data and Control cycle, Yuan-Ma replied: "Think of a compiler compiling itself."

Master Yuan-Ma, “The Book of Programming”

Creating your own programming language is surprisingly easy (as long as you don't set transcendental goals) and quite instructive.

The main thing I want to demonstrate in this chapter is that there is no magic in building a language. It often seemed to me that some human inventions were so complex and abstruse that I would never understand them. However, after a little self-education and picking, such things are often quite mundane.

We will build the programming language Egg. It will be small, simple, but powerful enough to express any calculations. It will also implement simple function-based abstractions.

Parsing

What lies on the surface of the language is syntax, notation. A grammar analyzer, or parser, is a program that reads a piece of text and produces a data structure that describes the structure of the program contained in the text. If the text does not describe the correct program, the parser must complain and indicate an error.

Our language will have a simple and uniform syntax. In Egg, everything will be an expression. An expression can be a variable, number, string, or application. Applications are used to call functions and constructs such as if or while.

To simplify parsing, lines in Egg will not support backslashes and the like. A string is simply a sequence of non-double-quoted characters enclosed in double quotes. Number is a sequence of numbers. Variable names can consist of any characters that are not spaces and have no special meaning in the syntax.

Applications are written in the same way as in JS - using brackets after the expression and with any number of arguments in brackets, separated by commas.

do(define(x, 10),
   if(>(x, 5)),
      print("много"),
      print("мало"))

The uniformity of the language means that what is operator in JS is applied in the same way as the rest of the functions. Since there is no concept of blocks in the syntax, we need a do construct to indicate several things that are done sequentially.

The data structure describing the program will consist of expression objects, each of which will have a type property that reflects the type of this expression and other properties that describe the contents.

Expressions of type “value” represent strings or numbers. Their value property contains the string or number that they represent. Expressions of the “word” type are used for identifiers (names). Such objects have a name property containing the name of the identifier as a string. Finally, “apply” expressions represent applications. They have an “operator” property that refers to the expression to use, and an “args” property with an array of arguments.

Part> (x, 5) will be represented as follows:

{
  type: "apply",
  operator: {type: "word", name: ">"},
  args: [
    {type: "word", name: "x"},
    {type: "value", value: 5}
  ]
}

This data structure is called a syntax tree. If you represent the objects in the form of points, and the connections between them in the form of lines, you will get a tree structure. The fact that expressions contain other expressions, which in turn can contain their expressions, is similar to how branches branch.

Syntax tree structure

Compare this with the parser we wrote for the settings file in chapter 9, which had a simple structure: it divided the input into lines and processed them one by one. There were only a few forms that are allowed to take a string.

Here we need a different approach. Expressions are not divided into lines, and their structure is recursive. Application expressions contain other expressions. Fortunately, this task is elegantly solved by applying a recursive function that reflects the recursiveness of the language.

We define a parseExpression function that takes an input string and returns an object containing the data structure for the expression from the beginning of the string, along with the portion of the string left after parsing. When parsing subexpressions (such as an application argument), this function is called again, returning the argument expression along with the remaining text. That text can, in turn, contain more arguments, or it can be a closing bracket ending the argument list.

The first part of the parser:

function parseExpression(program) {
  program = skipSpace(program);
  var match, expr;
  if (match = /^"([^"]*)"/.exec(program))
    expr = {type: "value", value: match[1]};
  else if (match = /^\d+\b/.exec(program))
    expr = {type: "value", value: Number(match[0])};
  else if (match = /^[^\s(),"]+/.exec(program))
    expr = {type: "word", name: match[0]};
  else
    throw new SyntaxError("Неожиданный синтаксис: " + program);
  return parseApply(expr, program.slice(match[0].length));
}
function skipSpace(string) {
  var first = string.search(/\S/);
  if (first == -1) return "";
  return string.slice(first);
}

Since Egg resolves any number of spaces in elements, we need to constantly cut spaces from the beginning of the line. SkipSpace handles this.

Skipping the leading spaces, parseExpression uses three regulars to recognize the three simple (atomic) elements supported by the language: strings, numbers, and words. The parser creates different structures for different types. If the input does not fit into any of the forms, this is not a valid expression, and it throws an error. SyntaxError is a standard error object that is created when you try to run an incorrect JavaScript program.

We can cut off the processed part of the program, and pass it, together with the expression object, to parseApply, which determines whether the expression is an application. If so, it parses the argument list in parentheses.

function parseApply(expr, program) {
  program = skipSpace(program);
  if (program[0] != "(")
    return {expr: expr, rest: program};
  program = skipSpace(program.slice(1));
  expr = {type: "apply", operator: expr, args: []};
  while (program[0] != ")") {
    var arg = parseExpression(program);
    expr.args.push(arg.expr);
    program = skipSpace(arg.rest);
    if (program[0] == ",")
      program = skipSpace(program.slice(1));
    else if (program[0] != ")")
      throw new SyntaxError("Ожидается ',' or ')'");
  }
  return parseApply(expr, program.slice(1));
}

If the next character of the program is a non-opening bracket, then this is not an application, and parseApply simply returns the expression given to it.

Otherwise, it skips the opening bracket and creates a syntax tree object for this expression. It then recursively calls parseExpression to parse each argument until it encounters a closing bracket. Recursion is indirect, parseApply and parseExpression call each other.

Since the application itself can be an expression (multiplier (2) (1)), parseApply should, after parsing the application, call itself again, checking to see if another pair of parentheses goes further.

That's all we need to parse Egg. We will wrap this in a convenient parse function, checking that it reaches the end of the line after parsing the expression (the Egg program is one expression), and this will give us the data structure of the program.

function parse(program) {
  var result = parseExpression(program);
  if (skipSpace(result.rest).length > 0)
    throw new SyntaxError("Неожиданный текст после программы");
  return result.expr;
}
console.log(parse("+(a, 10)"));
// → {type: "apply",
//    operator: {type: "word", name: "+"},
//    args: [{type: "word", name: "a"},
//           {type: "value", value: 10}]}

Works! It does not provide useful information on error, and does not store the row and column numbers with which each expression begins, which could be useful in parsing errors - but that's enough for us.

Interpreter

And what should we do with the syntax tree of the program? Run it! This is what the interpreter does. You give it a syntax tree and an environment object that associates names with values, and it interprets the expression represented by the tree and returns the result.

function evaluate(expr, env) {
  switch(expr.type) {
    case "value":
      return expr.value;
    case "word":
      if (expr.name in env)
        return env[expr.name];
      else
        throw new ReferenceError("Неопределённая переменная: " +
                                 expr.name);
    case "apply":
      if (expr.operator.type == "word" &&
          expr.operator.name in specialForms)
        return specialForms[expr.operator.name](expr.args,
                                                env);
      var op = evaluate(expr.operator, env);
      if (typeof op != "function")
        throw new TypeError("Приложение не является функцией.");
      return op.apply(null, expr.args.map(function(arg) {
        return evaluate(arg, env);
      }));
  }
}
var specialForms = Object.create(null);

The interpreter has code for each type of expression. For literals, it returns their value. For example, the expression 100 is interpreted as 100. For the variable, we must check whether it is defined in the environment, and if so, request its value.

It’s more complicated with applications. If this is a special form like if, we don’t interpret anything, we just pass the arguments along with the environment to the function that processes the form. If this is a simple call, we interpret the operator, check that it is a function, and call it with the result of the interpretation of the arguments.

To represent Egg function values, we will use simple JavaScript function values. We will come back to this later when we define the special form fun.

The recursive structure of the interpreter resembles a parser. Both reflect the structure of the language. It would be possible to integrate the parser into the interpreter and interpret during parsing, but their separation makes the program more readable.

That's all it takes to interpret Egg. So simple. But without defining a few special forms and adding useful values to the environment, you cannot do anything with this language.

Special forms

The specialForms object is used to define Egg special syntax. He juxtaposes words with functions that interpret these special forms. While it is empty. Let's add some forms.

specialForms["if"] = function(args, env) {
  if (args.length != 3)
    throw new SyntaxError("Неправильное количество аргументов для if");
  if (evaluate(args[0], env) !== false)
    return evaluate(args[1], env);
  else
    return evaluate(args[2], env);
};

Egg's if construct has three arguments. It computes the first, and if the result is not false, it computes the second. Otherwise, it calculates the third. This if is more like the ternary operator?:. This is an expression, not an instruction, and it produces a value, namely, the result of the second or third expression.

Egg differs from JavaScript in the way it handles the if condition. It will not consider zero or empty string as false.

if is presented in the form of a special form rather than a regular function, because the arguments of the functions are evaluated before the call, and if must interpret one of the two arguments - the second or third, depending on the value of the first.

The form for while is similar.

specialForms["while"] = function(args, env) {
  if (args.length != 2)
    throw new SyntaxError("Неправильное количество аргументов для while");
  while (evaluate(args[0], env) !== false)
    evaluate(args[1], env);
  // Поскольку undefined не задано в Egg, 
  // за отсутствием осмысленного результата возвращаем false
  return false;
};

Another main part of the language is do, which executes all arguments from top to bottom. Its value is the value returned by the last argument.


specialForms["do"] = function(args, env) {
  var value = false;
  args.forEach(function(arg) {
    value = evaluate(arg, env);
  });
  return value;
};

To create variables and give them values, we create a define form. She expects the word as the first argument, and the expression that produces the value to be assigned to this word as the second. define, like everything, is an expression, so it should return a value. Let it return the assigned value (just like the = operator in JavaScript).

specialForms["define"] = function(args, env) {
  if (args.length != 2 || args[0].type != "word")
    throw new SyntaxError("Bad use of define");
  var value = evaluate(args[1], env);
  env[args[0].name] = value;
  return value;
};

Environment

The environment accepted by the interpreter is an object with properties whose names correspond to the names of variables and the values to the values of these variables. Let's define an environment object representing the global scope.

To use the if construct, we must create Boolean values. Since there are only two of them, a special syntax is not needed for them. We just make two variables with true and false values.

var topEnv = Object.create(null);
topEnv["true"] = true;
topEnv["false"] = false;
Теперь мы можем вычислить простое выражение, меняющее булевское значение на обратное.
var prog = parse("if(true, false, true)");
console.log(evaluate(prog, topEnv));
// → false

To support simple arithmetic operators and comparisons, we will add several functions to the environment. To simplify the code, we will use new Function to create a set of operator functions in a loop, rather than defining them individually.

["+", "-", "*", "/", "==", "<", ">"].forEach(function(op) {
  topEnv[op] = new Function("a, b", "return a " + op + " b;");
});

A way to output values is also useful, so we wrap console.log in a function and call it print.

topEnv["print"] = function(value) {
  console.log(value);
  return value;
};

This gives us enough basic tools for writing simple programs. The following run function provides a convenient way to write and run. She creates a fresh environment, parses and parses the lines that we pass to her, as if they are one program.

function run() {
  var env = Object.create(topEnv);
  var program = Array.prototype.slice
    .call(arguments, 0).join("\n");
  return evaluate(parse(program), env);
}

Using Array.prototype.slice.call is a trick to turning an object that looks like an array, such as arguments, into a real array, so that we can apply join to it. She takes all the arguments passed to run, and believes that they are all lines of the program.

run("do(define(total, 0),",
    "   define(count, 1),",
    "   while(<(count, 11),",
    "         do(define(total, +(total, count)),",
    "            define(count, +(count, 1)))),",
    "   print(total))");
// → 55

We have already seen this program several times - it counts the sum of numbers from 1 to 10 in the Egg language. It is uglier than the equivalent JavaScript program, but not so bad for a language defined with less than 150 lines of code.

Functions

A programming language without functions is a bad language.

Fortunately, it is easy to add a fun construct that treats the last argument as the body of the function, and all the previous ones as the names of the arguments to the function.

specialForms["fun"] = function(args, env) {
  if (!args.length)
    throw new SyntaxError("Функции нужно тело");
  function name(expr) {
    if (expr.type != "word")
      throw new SyntaxError("Имена аргументов должны быть типа word");
    return expr.name;
  }
  var argNames = args.slice(0, args.length - 1).map(name);
  var body = args[args.length - 1];
  return function() {
    if (arguments.length != argNames.length)
      throw new TypeError("Неверное количество аргументов");
    var localEnv = Object.create(env);
    for (var i = 0; i < arguments.length; i++)
      localEnv[argNames[i]] = arguments[i];
    return evaluate(body, localEnv);
  };
};

Functions in Egg have their own local environment, as in JavaScript. We use Object.create to create a new object that has access to variables in the external environment (our prototype), but it can also contain new variables without changing the external scope.

The function created by the fun form creates its own local environment and adds argument variables to it. She then interprets the body in this environment and returns the result.

run("do(define(plusOne, fun(a, +(a, 1))),",
    "   print(plusOne(10)))");
// → 11
run("do(define(pow, fun(base, exp,",
    "     if(==(exp, 0),",
    "        1,",
    "        *(base, pow(base, -(exp, 1)))))),",
    "   print(pow(2, 10)))");
// → 1024

Compilation

We have built an interpreter. During the interpretation, it works with the program view created by the parser.

Compilation - adding one more step between the parser and the launch of the program, which turns the program into something that can be done more efficiently by doing most of the work in advance. For example, in well-organized languages, each time you use a variable, it is obvious which variable is accessed, even without starting the program. This can be used not to look for a variable by name every time it is accessed, but to directly call it from some predefined memory area.

By tradition, compilation also turns a program into machine code - a raw format suitable for execution by a processor. But each process of turning a program into a different kind is, in essence, a compilation.

You could create another Egg interpreter that first turns the program into a JavaScript program, uses the new Function to invoke the JavaScript compiler, and returns the result. With the right implementation, Egg would run very quickly with a relatively simple implementation.

If you are interested in this and want to spend time on it, I encourage you to try making such a compiler as an exercise.

Fraud

When we defined if and while, you might notice that they were simple wrappers around if and while in JavaScript. Values in Egg are also common JavaScript values.

Comparing the JavaScript implementation of Egg with the amount of work needed to create a programming language directly in the machine language, the difference becomes huge. However, this example, I hope, gives you an idea of how programming languages work.

And when you need to do something, cheating will be more effective than doing everything from scratch yourself. Although a toy language is no better than JavaScript, in some situations writing your own language helps you get the job done faster.

Such a language does not have to resemble the usual YP. If JavaScript did not contain regular expressions, you could write your parser and interpreter for such a sub-language.

Or imagine that you are building a giant dinosaur robot and you need to program its behavior. JavaScript is not the most efficient way to do this. Instead, you can choose a language with something like this:

behavior walk
  perform when
    destination ahead
  actions
    move left-foot
    move right-foot
behavior attack
  perform when
    Godzilla in-view
  actions
    fire laser-eyes
    launch arm-rockets

This is usually called the language for the selected area (domain-specific language) - a language specifically designed to work in a narrow direction. Such a language can be more expressive than a general-purpose language, because it is designed to express exactly those things that need to be expressed in this area - and nothing more.

Exercises

Arrays

Add array support to Egg. To do this, add three functions to the main scope: array (...) to create an array containing the values of the arguments, length (array) to return the length of the array and element (array, n) to return the nth element.

// Добавьте кода
topEnv["array"] = "...";
topEnv["length"] = "...";
topEnv["element"] = "...";
run("do(define(sum, fun(array,",
    "     do(define(i, 0),",
    "        define(sum, 0),",
    "        while(<(i, length(array)),",
    "          do(define(sum, +(sum, element(array, i))),",
    "             define(i, +(i, 1)))),",
    "        sum))),",
    "   print(sum(array(1, 2, 3))))");
// → 6

Short circuits

The method for defining fun allows functions in Egg to lock themselves around the environment, and use local variables in the body of the function that are visible during the definition, just like in JavaScript functions.

The following program illustrates this: the function f returns a function that adds its argument to the argument f, that is, it needs access to the local scope inside f to use the variable a.

run("do(define(f, fun(a, fun(b, +(a, b)))),",
    "   print(f(4)(5)))");
// → 9

Explain, using the definition of the form fun, which mechanism allows this construct to work.

Comments

It would be nice to have comments in Egg. For example, we could ignore the rest of the string by encountering the “#” character - just as it does with the // in JS.

You won’t have to make big changes to the parser. We simply change skipSpace so that it skips comments as if they are spaces - and in all places where skipSpace is called, comments will be skipped too. Make this change.

// Поменяйте старую функцию
function skipSpace(string) {
  var first = string.search(/\S/);
  if (first == -1) return "";
  return string.slice(first);
}
console.log(parse("# hello\nx"));
// → {type: "word", name: "x"}
console.log(parse("a # one\n   # two\n()"));
// → {type: "apply",
//    operator: {type: "word", name: "a"},
//    args: []}

Fix the scope

Now we can assign a value to a variable only through define. This design works both when assigning to old variables, and when creating new ones.

This ambiguity leads to problems. If you are trying to assign a new value to a non-local variable, you instead define a local one with the same name. (Some languages do this, but it always seemed to me a wacky way to work with scope).

Add a set form similar to define, which assigns a new value to the variable, updating the variable in the external scope if it is not set in the local one. If the variable is not set at all, throw a ReferenceError (another standard type of error).

The technique of representing scopes in the form of simple objects, which until now was convenient, will now bother you. You may need the Object.getPrototypeOf function that returns the prototype of an object. Also remember that the scope is not inherited from Object.prototype, so if you need to call hasOwnProperty on them, you have to use such a clumsy construction:

Object.prototype.hasOwnProperty.call(scope, name);

This calls the hasOwnProperty method of the Object prototype and then calls it on the scope object.

specialForms["set"] = function(args, env) {
  // Ваш код
};
run("do(define(x, 4),",
    "   define(setx, fun(val, set(x, val))),",
    "   setx(50),",
    "   print(x))");
// → 50
run("set(quux, true)");
// → Ошибка вида ReferenceError

Tags: