Creating a programming language using LLVM. Part 1: Introduction and lexical analysis

Original author: Chris Lattner, Erick Tryzelaar
  • Transfer
Welcome to the tutorial “Creating a Programming Language with LLVM”. This tutorial introduces you to the creation of a simple programming language, and at the same time shows how easy and interesting it can be, and also gives you basic knowledge that you can then apply to other programming languages. The code in this tutorial can also be used as a launching pad for your creations using LLVM.

The purpose of this textbook is to gradually introduce our language, a description of its step-by-step creation. This will allow us to cover a fairly wide range of issues of language design and the use of LLVM, simultaneously showing and explaining the code without a huge amount of unnecessary details.

It is useful to note in advance that this tutorial does teach some compilation techniques and the specifics of LLVM, but it does not teach the modern principles of software development. In practice, this means that we will use the most simplified and reduced solutions to simplify the presentation. So, for example, the code for working with memory everywhere uses global variables without using excellent design patterns such as Visitor and others - this is not the best way, but it is very simple and clear. If you understand the code and use it as the basis for future projects, then correcting such shortcomings will not be a problem for you.

I tried to compose this tutorial in such a way that you could easily skip parts that you already know or are not interested in. The structure of the textbook is as follows:
  • Chapter 1 : Introduction to the Kaleidoscope language and definition of its lexical analyzer - It will show in which direction we are moving and the basic functionality that we have to develop. In order to make this tutorial easier to understand, and the examples closer to reality, we decided to implement everything only in pure C ++ without the use of parser generators. LLVM, of course, works great with such tools, so you can easily use one of them.
  • Chapter 2 : Implementing the parser and AST- With a ready-made lexical analyzer, we can already talk about parsing methods and the basics of constructing AST. This guide describes recursive descent analysis and parsing operator precedence. Nothing in Chapters 1 or 2 applies to LLVM; the code does not even bind to LLVM in these chapters. :)
  • Chapter 3 : LLVM IR Code Generation- With the AST ready, we can show how easy it is to actually generate LLVM IR.
  • Chapter 4 : Adding JIT and Optimizer Support- Since many people are interested in using LLVM as JIT, we will dive into it and show you 3 steps to add JIT support. LLVM is also useful in many other ways, but it is one of the easiest and most “sexual” ways to demonstrate your strength. :)
  • Chapter 5 : Extending the language: Control flow- Now that we have a working programming language, we can expand it with control constructs (if / then / else and the “for” loop). This gives us the opportunity to talk about simple SSA building and flow control.
  • Chapter 6 : Language Extension: User-Defined Operators is a simple but interesting chapter about the language extension, which allows the user to define their own unary and binary operators (with an assignable priority!). This will allow us to build a significant chunk of the "programming language" as a library of routines.
  • Chapter 7 : Extending the Language: Variable Variables - This chapter talks about adding custom local variables using the assignment operator. The most interesting thing here is how easy and trivial it is to build in the SSA form: and no, LLVM does NOT require the use of the SSA form in your front end!
  • Chapter 8 : Conclusion and other LLVM goodies - This chapter concludes the series by talking about possible ways to expand the language, and also includes a bunch of links to various topics, such as Garbage Collection support, exceptions, debugging, spaghetti support -stack ”(“ spaghetti stacks ”), and a bunch of other tips and tricks.
By the end of the tutorial, we wrote a little less than 700 lines of code without taking into account comments and empty lines. With this small amount of code, we created a very good compiler for a non-trivial programming language, including a handwritten lexical analyzer, parser, AST, as well as support for code generation with JIT. While other systems have Hello, World-style tutorials, the width of this tutorial is a great testament to the power of LLVM. If you are interested in building languages ​​and compilers, you should definitely read this tutorial.

And one more thing: we expect from you that you will expand the language and play with it at our discretion. Take the code and start frantically programming, go crazy, compilers should not scare you - it's so fun to play with languages!

Tongue Kaleidoscope

This tutorial will illustrate a toy programming language, which we called " Kaleidoscope " (the name comes from three Greek words: "beautiful", "look" and "look, watch"). Kaleidoscope is a procedural language that allows you to define functions, use conditions, math, etc. During the study, we will expand the Kaleidoscope to support if / then / else constructs, loops, user statements, JIT compilation with a simple console interface, etc.

For simplicity, there is only one data type in the Kaleidoscope - a 64-bit floating-point number (like “double” in C). Thus, all values ​​are double precision real numbers and the language does not require type declarations. This makes the language very beautiful and simplifies the syntax. For example, the following simple example calculatesFibonacci numbers :

# Расчет x-го числа Фиббоначчи
def fib(x)
  if x < 3 then
    1
  else
    fib(x-1)+fib(x-2)
# Это выражение расчитывает 40-е число.
fib(40)

We also allowed Kaleidoscope to call standard library functions (LLVM JIT makes this quite trivial). This means that you can use the extern keyword to declare a function before using it (this is also useful for mutually recursive functions). For instance:

extern sin(arg);
extern cos(arg);
extern atan2(arg1 arg2);
atan2(sin(.4), cos(42))

A more interesting example is given in Chapter 6, where we will write a small application on the Kaleidoscope that displays the Mandelbrot set in different degrees of approximation.

And now dive into the implementation of this language!

Lexical analyzer

When it comes to implementing a programming language, the first thing you need is the ability to process a text file and recognize code in it. The traditional way to do this is to use a “ lexical analyzer ” (or “scanner”) to break the input stream into “tokens”. Each token returned by the lexical analyzer includes a code unit and potentially some metadata (for example, a numerical value). First, we will identify the possibilities:


// Лексический анализатор возвращает токены [0-255], если это неизвестны, 
// иначе одну из известных единиц кода
enum Token {
  tok_eof = -1,
  // команды (ключевые слова)
  tok_def = -2, tok_extern = -3,
  // операнды (идентификаторы, числа)
  tok_identifier = -4, tok_number = -5,
};
static std::string IdentifierStr;  // Заполняется, если tok_identifier (если токен - идентификатор)
static double NumVal;              // Заполняется, если tok_number (если токен - число)

Each token returned by our lexical analyzer will either be one of the Token enumeration values ​​or it will be “unknown” like the “+” character, which is returned as an ASCII value. If the current token is an identifier, then the global variable IdentifierStr contains the identifier name. If the current token is a numeric literal (e.g. 1.0), NumVal contains its value. Please note that we use global variables for simplicity, but this is far from the best choice for real use :).

In fact, our lexical analyzer is implemented by a single function called gettok. The Gettok function, when called, returns the next token from the standard input stream. Her definition begins like this:


/// gettok - Возвращает следующий токен из стандартного потока ввода.
static int gettok() {
  static int LastChar = ' ';
  // Пропускаем пробелы.
  while (isspace(LastChar))
    LastChar = getchar();

gettok calls the getchar () C function to read characters from one of the standard inputs. It recognizes them and saves the last read character in LastChar, but does not process it. The first thing she should do is to ignore the spaces between tokens. This is achieved in the above loop.

The next gettok action is to recognize identifiers and special keywords such as “def”. Kaleidoscope does this with this simple loop:


  if (isalpha(LastChar)) { // идентификатор: [a-zA-Z][a-zA-Z0-9]*
    IdentifierStr = LastChar;
    while (isalnum((LastChar = getchar())))
      IdentifierStr += LastChar;
    if (IdentifierStr == "def") return tok_def;
    if (IdentifierStr == "extern") return tok_extern;
    return tok_identifier;
  }

Please note that this code collects the global variable “IdentifierStr” while parsing the identifier. In addition, language keywords are checked in the same cycle. For numerical values, everything is similar:


  if (isdigit(LastChar) || LastChar == '.') {   // Число: [0-9.]+
    std::string NumStr;
    do {
      NumStr += LastChar;
      LastChar = getchar();
    } while (isdigit(LastChar) || LastChar == '.');
    NumVal = strtod(NumStr.c_str(), 0);
    return tok_number;
  }

This is pretty simple code to process the input. When reading a numeric value, we use the strtod C function to convert it to a numeric value, which we store in NumVal. Note that there is not enough error checking here: the value “1.23.45.67” will be read and processed as if you entered in “1.23”. You can easily supplement the code with checking for such errors :). Now let's deal with the comments:


  if (LastChar == '#') {
    // Комментарий до конца строки.
    do LastChar = getchar();
    while (LastChar != EOF && LastChar != '\n' && LastChar != '\r');
    if (LastChar != EOF)
      return gettok();
  }

Having defined the comment, we skip the characters to the end of the line, and then return the next token. Then, if the input does not coincide with one of the above cases, then this is either an operator like "+" or the end of the file. They are processed by this code:


  // Проверка конца файла.
  if (LastChar == EOF)
    return tok_eof;
  // В противном случае просто возвращаем символ как значение ASCII
  int ThisChar = LastChar;
  LastChar = getchar();
  return ThisChar;
}

At the same time, we have a full lexical analyzer for the Kaleidoscope language (a complete listing of the lexical analyzer code is available in the next chapter from the tutorial). Next, we will develop a simple parser that we use to build the Abstract Syntax Tree. Then we can use the lexical and parser together.

UPD
Useful links that are useful in learning:

Also popular now: