Friday programmer, or as I wrote a library for lexical and parsing code

From the sandbox

Hello! As a programmer, I am always looking for ways to improve my skills. One Friday evening, the thought came to my head - “But would you like to write a compiler for me?”

Who is interested in finding out what came out of it, welcome to the cat.

Based on V.E. Karpov's “Classical Compiler Theory”, there are five main compilation stages:

Lexical analysis
Parsing
Generate middle code
Optimization
Generate target, object code

About all five parts, you can write a lot of sentences and articles. But, today, we will talk about the first two and how I separated their structure into a separate library.

When I decided to write, even if only a small part, a compiler, I did not think about which language I was writing for, for this reason, the result was a library for lexical and syntactic analysis of any language.

Tokenization

Before building a syntax and lexical tree, generating the resulting code and doing other tasty things, the source code needs to be broken down into lines, characters, numbers.

Where each element will have exactly a specific type. Tokens with undefined type when parsing will be considered as syntax errors.

In the context of compilation, the source code is considered as a source-map, it will be good practice to keep it in the process of lexical and syntactic analysis for feedback from the programmer and specifying syntax errors in the source code.

To break the source code into an array of tokens, you can use a simple regular expression:

/\S+/gm

It may change depending on additional parsing conditions, such as: parsing line breaks, parsing tabs, parsing spaces.

The result of the separation will be an array of words of the source code, with the words parsing from space to space, i.e. such construction:

let hello = 'world';

Will be converted to the following set of tokens:

["let", "hello", "=", "'world';"]

To get the final set of tokens, you need to go through each of them, with an additional regular expression:

/(\W)|(\w+)/gm

The result will be the required set of tokens:

["let", "hello", "=", "'", "world", "'", ";"]

We write all the tokens that we received into the array, along with their indices in the source-map

Lexical analysis

Now that we have an array of tokens, we need to determine their type, in order to pass them to syntactic analysis.

The algorithm that defines tokens and their types is called Lexer
Token and its type that Lexer defines is called Lexeme.

Each token can have a uniquely defined type, for example:

['let', 'const', 'var'] // Keywords
['=', '+', '-', '*', '/'] // Operators
И т.д.

I, implemented a scheme for determining the types of tokens, with the help of the so-called Solver.
It works as follows:

1. You define constants for types of tokens:

const DefaultTokenTypes = {
    KEYWORD: "Keyword",
    IDENTIFIER: "Identifier",
    OPERATOR: "Operator",
    DELIMITER: "Delimiter",
    LINEBREAK: "Linebreak",
    STRING: "String",
    NUMERIC: "Numeric",
    UNKNOWN: 'Unknown'
};

2. Further, it is necessary to determine the so-called Solvers:

const solvers = {};
solvers[MyTokenTypes.KEYWORD] = {
    include: [
        'const',
        'let'
    ]
};
solvers[MyTokenTypes.NUMERIC] = {
    regexp: /^[0-9.]*$/gm
};
solvers[DefaultTokenTypes.STRING] = {
    type: StringSolver,
    delimiters: ['"', "'", '`']
};
solvers[MyTokenTypes.IDENTIFIER] = {
    regexp: /^[a-zA-Z_][a-zA-Z_0-9]*$/gm
};
solvers[MyTokenTypes.DELIMITER] = {
    default: true
};

As you can see, tokens can have different variations of settings:

include - Array of words, by coincidence with which, the type of token can be determined.
regexp - A regular expression that, by coincidence, can determine the type of token.
default - Standard type for undefined tokens.

Also, you can see the type parameter , which indicates that this solver should be inherited from the one specified in type.

In this case, the solver determines the strings that are enclosed in one of the characters specified in delimiters

3. We use solvers for an array of tokens and we get an array of typed tokens. For this, source code:

let a = 50;

We get the following tree:

[
  {
    "type": "Keyword",
    "value": "let",
    "range": [0, 3]
  },
  {
    "type": "Identifier",
    "value": "a",
    "range": [4, 5]
  },
  {
    "type": "Delimiter",
    "value": "=",
    "range": [6, 7]
  },
  {
    "type": "Numeric",
    "value": "50",
    "range": [8, 10]
  },
  {
    "type": "Delimiter",
    "value": ";",
    "range": [10, 11]
  }
]

Where range - The beginning and end of the fragment in the source code.

Parsing

After receiving the array of tokens, you should parse them and define the syntactic structure (tree) of the source code.

There are various options for parsing, but I chose a downward, direct algorithm.

Tokens parse one by one using an array of templates. If the pattern matches the current sequence of tokens - in the syntax tree, a new branch is created.

An example of a single array pattern:

let declaration = new SequenceNode({
    tokenType: MyTokenTypes.KEYWORD,
    type: MyNodeTypes.DECLARATION,
    sequence: [
        {type: MyTokenTypes.KEYWORD},
        {type: MyTokenTypes.IDENTIFIER},
        {type: MyTokenTypes.DELIMITER},
        {type: [MyTokenTypes.NUMERIC, MyTokenTypes.STRING]},
        ';' 
    ],
    onError: (e) => {
        //e - Syntax error
    }
});

tokenType - Describes the token from which to start checking for a match.
type - Describes the type of the node, all types - must also be defined, similar to the types of tokens.
sequence - A sequence array containing types of tokens, specific values, or other nodes of the syntax tree.
onError - Callback, which will work on a syntax error, while parsing this node, it returns the type of error + its place in the source code.

Let us analyze the sequence of this node:

[
        {type: MyTokenTypes.KEYWORD}, // Текущий токен -> должен быть ключевым словом
        {type: MyTokenTypes.IDENTIFIER},// Текущий токен + 1 -> должен быть идентификатором
        {type: MyTokenTypes.DELIMITER},// Текущий токен + 2 -> должен быть разделителем
        {type: [MyTokenTypes.NUMERIC, MyTokenTypes.STRING]},// Текущий токен + 2 -> должен быть строкой или числом';'// Текущий токен + 3 -> должен быть точкой с запятой
],

I implemented several variations of nodes for different purposes: the definition of sequences of tokens, the definition of a group of elements (Arguments, blocks). That allows you to easily parse the arrow functions.

You can read about all the variations of Solvers and Nodes that I have implemented in the documentation of this library.

Materials

→ Link to source code: tyk
→ classical compiler theory

Tags:

Friday programmer, or as I wrote a library for lexical and parsing code

Tokenization

Lexical analysis

Parsing

Materials

Also popular now: