Creating a programming language using LLVM. Part 10: Conclusion and other LLVM goodies
- Transfer
- Tutorial
Table of contents:
Part 1: Introduction and lexical analysis
Part 2: Implementing the parser and AST
Part 3: Generating LLVM IR code
Part 4: Adding JIT and optimizer support
Part 5: Language extension: Control flow
Part 6: Language extension: User-defined operators
Part 7: Extension of the language: Variable variables
Part 8: Compilation into object code
Part 9: Adding debugging information
Part 10: Conclusion and other goodies LLVM
Welcome to the final part of the tutorial “Creating a programming language using LLVM”. Throughout this tutorial, we have grown our small Kaleidoscope language from a useless toy to a rather interesting (although perhaps still useless) toy.
It is interesting to see how far we have come, and how little code this has required. We have built a complete lexical analyzer, parser, AST, code generator, interactive execution (with JIT!) And the generation of debugging information into a separate executable file - all this is less than 1000 lines of wad (excluding empty lines and comments).
Our small language supports a couple of interesting features: it supports user-defined binary and unary operators, uses JIT compilation for immediate execution, and supports some flow control constructions by generating code in an SSA form.
Part of the idea in this guide was to show you how easy it is to define, build, and play with the language. Building a compiler does not have to be a scary or mystical process! Now that you’ve seen the basics, I highly recommend taking the code and dealing with it. For example, try adding:
global variables - although the value of global variables in modern software engineering is questionable, they are often used for small fast hacks, such as the Kaleidoscope compiler itself. Fortunately, adding global variables to our program is very easy: we simply check for each variable whether it is in the global symbol table. For. To create a new global variable, create an instance of the LLVM GlobalVariable class.
typed variables - now Kaleidoscope only supports double variables. This makes the language very elegant, because support for only one type means that you do not need to specify variable types. Different languages have different ways to solve this problem. The easiest way is to require the user to specify a type for each variable definition, and write the types of variables in the symbol table along with their Value *.
arrays, structures, vectors, etc. If you enter types, you can begin to expand the type system in various interesting directions. Simple arrays can be made very simple and useful for different types of applications. Add them as an exercise to learn how the LLVM getelementptr instruction works: it is so elegant and unusual that it has its own FAQ!
standard runtime - in its current form, the language gives the user the ability to access arbitrary external functions, and we use this for things like “printd” and “putchard”. You can expand the language so as to add higher-level constructions; it often makes sense to bring such constructions to runtime functions than to make them in the form of inline-sequences of commands.
memory management - now in the Kaleidoscope language there is access only to the stack. It will also be useful to allocate memory on the heap, either by calling the standard libc malloc / free interfaces, or using the garbage collector. If you prefer the garbage collector, we note that LLVM fully supports Accurate Garbage Collection, including algorithms for moving objects, and those necessary for scanning / updating the stack.
exception support - LLVM supports the generation of exceptions with zero cost and with the ability to interact with code compiled in other languages. You can also generate code, which implies that each function returns an error value, and checks this. You can also implement exceptions by explicitly using setjmp / longjmp. In general, there are many different ways.
OOP, generalized types, access to databases, complex numbers, geometric programming, ... in fact, there is no end to the crazy things that can be added to the language.
unusual applications - we talked about using LLVM in an area that many are interested in: building a compiler for a specific language. However, there are many other areas for which the use of the compiler, at first glance, is not considered. For example, LLVM is used to speed up OpenGL graphics, translate C ++ code into ActionScript, and many other interesting things. Perhaps you will be the first to build a JIT compiler into native regular expression code using LLVM?
pleasure - try to do something crazy and unusual. Making a language the same as everyone else is not as fun as doing something crazy. If you want to talk about this, feel free to write to the llvm-dev mailing list: there are many people who are interested in languages and often want to help.
Before we finish the tutorial, I want to give some tips on generating LLVM IR. There are some subtleties that may not be obvious, but very useful if you want to take advantage of the capabilities of LLVM.
There are a couple of questions about LLVM IR code, let's look at them now.
A kaleidoscope is an example of a "portable language": any program written on a Kaleidoscope will work the same on any target platform on which it will be launched. Many other languages have the same property, for example, Lisp, Java, Haskell, Javascript, Python, etc. (note that although these languages are portable, not all of their libraries are portable).
One good aspect of LLVM is to maintain independence from the target platform at the IR level: you can take the LLVM IR for the program compiled by Kaleidoscope and run on any target platform supported by LLVM, even generate C code and compile on those target platforms that Kaleidoscope does not supports natively. We can say that the Kaleidoscope compiler generates platform-independent code because it does not request any information about the platform when generating the code.
The fact that LLVM provides a compact, platform-independent code presentation is very attractive. Unfortunately, people often only think about compiling C or C-like languages when they ask about language portability. I said “unfortunately”, because in fact it is impossible (in the general case) to make the C-code portable, because, of course, the C source code itself is not portable in the general case, even in the case of porting applications from 32 to 64 bits.
The problem with C (again, in the general case) is that it relies heavily on platform-specific assumptions. As a simple example, a preprocessor will make the code platform dependent if it processes the following text:
Although it is possible to solve this problem in various complex ways, it cannot be solved in a general way.
But a subset of C can be made portable. If you make primitive types of a fixed size (for example, int = 32 bits, long = 64 bits), don’t worry about ABI compatibility with existing binary files, and give up some other features, then you can get portable code. This makes sense for some special cases.
Many of the languages mentioned are also “safe”: it is impossible for a program written in Java to spoil the address space and drop the process (assuming that the JVM has no bugs). Security is an interesting feature that requires a combination of language design, runtime support, and often OS support.
It is definitely possible to implement a secure language in LLVM, but LLVM IR alone does not guarantee security. LLVM IR allows unsafe pointer conversions, memory usage after freeing it, buffer overflows, and various other problems. Security should be implemented at a level higher than LLVM and, fortunately, several groups have been researching this issue. Ask on the llvm-dev mailing list if you are interested in the details.
There is one thing in LLVM that many do not like: it does not solve all the problems of the world in one system (sorry, starving children, someone else must solve your problem, not today). One complaint made by LLVM is that it is unable to perform high-level, language-specific optimization: LLVM "loses too much information."
Unfortunately, there is no place to write for you a complete and universal version of the "theory of compiler design". Instead, I will make a few observations:
The first is true, LLVM is losing information. For example, it is impossible to distinguish at the LLVM IR level whether the SSA value was generated from type C "int" or "long" on the ILP32 machine (except from debug information). Both are compiled into a value of type “i32”, and information about the source type is lost. A more general problem is that the LLVM type system considers types with the same structure, not with the same name equivalent. This is another thing that surprises people that if you have two types in a high-level language that have the same structure (for example, two different structures having one int field): these types will be compiled into one LLVM type, and it will be impossible say which initial structure the variables belonged to.
Second, although LLVM is losing information, it does not have a fixed target platform: we continue to expand and improve it in different directions. We add new features (LLVM did not always support exceptions or debugging information), we extend IR to capture information important for optimization (whether the argument was expanded with zeros or a signed bit, information about pointer aliasing, etc.). Many improvements are initiated by users: people want LLVM to have any specific features, and we go towards them.
Thirdly, it is possible to easily add language-specific optimizations, and there are a number of ways to do this. As a trivial example, it’s easy to add an optimization pass that “knows” different things about the source code. In the case of C-like languages, this optimization pass “knows” about the functions of the standard C library. If you call the function “exit (0)” in main (), he knows that the call can be safely converted to “return 0”, because standard C describes what the exit function should do.
In addition to simple knowledge of the library, it is possible to embed other various language-specific information in LLVM IR. If you have specific needs, please write to the llvm-dev mailing list. In the worst case scenario, you can treat LLVM as if it were a “dumb code generator” and implement the high-level optimizations you want in your frontend, in your language specific AST.
There are various useful tricks and tricks that you come to after you have worked with / on LLVM, and which are not obvious at first glance. So that everyone does not rediscover them, this section is devoted to some of them.
One interesting thing is that if you are trying to keep the code generated by your compiler “platform independent”, you need to know the size of the LLVM types and the offset of the specific subviews in the structures. For example, you can pass the type size to a function that allocates memory.
Unfortunately, the size of types can vary greatly depending on the platform: the size of the pointer is the simplest example. A smart way to solve such problems is to use the getelementptr statement .
Some languages want to explicitly manage stack frames, often because of = the presence of a garbage collector or in order to make closures easier to implement. Often there are better ways to implement these features than explicitly managing stack frames, but LLVM supports this if you want. To do this, your front-end must convert the code to Continuation Passing Style and use tail calls (which LLVM also supports).
Part 1: Introduction and lexical analysis
Part 2: Implementing the parser and AST
Part 3: Generating LLVM IR code
Part 4: Adding JIT and optimizer support
Part 5: Language extension: Control flow
Part 6: Language extension: User-defined operators
Part 7: Extension of the language: Variable variables
Part 8: Compilation into object code
Part 9: Adding debugging information
Part 10: Conclusion and other goodies LLVM
9.1. Conclusion
Welcome to the final part of the tutorial “Creating a programming language using LLVM”. Throughout this tutorial, we have grown our small Kaleidoscope language from a useless toy to a rather interesting (although perhaps still useless) toy.
It is interesting to see how far we have come, and how little code this has required. We have built a complete lexical analyzer, parser, AST, code generator, interactive execution (with JIT!) And the generation of debugging information into a separate executable file - all this is less than 1000 lines of wad (excluding empty lines and comments).
Our small language supports a couple of interesting features: it supports user-defined binary and unary operators, uses JIT compilation for immediate execution, and supports some flow control constructions by generating code in an SSA form.
Part of the idea in this guide was to show you how easy it is to define, build, and play with the language. Building a compiler does not have to be a scary or mystical process! Now that you’ve seen the basics, I highly recommend taking the code and dealing with it. For example, try adding:
global variables - although the value of global variables in modern software engineering is questionable, they are often used for small fast hacks, such as the Kaleidoscope compiler itself. Fortunately, adding global variables to our program is very easy: we simply check for each variable whether it is in the global symbol table. For. To create a new global variable, create an instance of the LLVM GlobalVariable class.
typed variables - now Kaleidoscope only supports double variables. This makes the language very elegant, because support for only one type means that you do not need to specify variable types. Different languages have different ways to solve this problem. The easiest way is to require the user to specify a type for each variable definition, and write the types of variables in the symbol table along with their Value *.
arrays, structures, vectors, etc. If you enter types, you can begin to expand the type system in various interesting directions. Simple arrays can be made very simple and useful for different types of applications. Add them as an exercise to learn how the LLVM getelementptr instruction works: it is so elegant and unusual that it has its own FAQ!
standard runtime - in its current form, the language gives the user the ability to access arbitrary external functions, and we use this for things like “printd” and “putchard”. You can expand the language so as to add higher-level constructions; it often makes sense to bring such constructions to runtime functions than to make them in the form of inline-sequences of commands.
memory management - now in the Kaleidoscope language there is access only to the stack. It will also be useful to allocate memory on the heap, either by calling the standard libc malloc / free interfaces, or using the garbage collector. If you prefer the garbage collector, we note that LLVM fully supports Accurate Garbage Collection, including algorithms for moving objects, and those necessary for scanning / updating the stack.
exception support - LLVM supports the generation of exceptions with zero cost and with the ability to interact with code compiled in other languages. You can also generate code, which implies that each function returns an error value, and checks this. You can also implement exceptions by explicitly using setjmp / longjmp. In general, there are many different ways.
OOP, generalized types, access to databases, complex numbers, geometric programming, ... in fact, there is no end to the crazy things that can be added to the language.
unusual applications - we talked about using LLVM in an area that many are interested in: building a compiler for a specific language. However, there are many other areas for which the use of the compiler, at first glance, is not considered. For example, LLVM is used to speed up OpenGL graphics, translate C ++ code into ActionScript, and many other interesting things. Perhaps you will be the first to build a JIT compiler into native regular expression code using LLVM?
pleasure - try to do something crazy and unusual. Making a language the same as everyone else is not as fun as doing something crazy. If you want to talk about this, feel free to write to the llvm-dev mailing list: there are many people who are interested in languages and often want to help.
Before we finish the tutorial, I want to give some tips on generating LLVM IR. There are some subtleties that may not be obvious, but very useful if you want to take advantage of the capabilities of LLVM.
10.2. LLVM IR Properties
There are a couple of questions about LLVM IR code, let's look at them now.
10.2.1. Target platform independence
A kaleidoscope is an example of a "portable language": any program written on a Kaleidoscope will work the same on any target platform on which it will be launched. Many other languages have the same property, for example, Lisp, Java, Haskell, Javascript, Python, etc. (note that although these languages are portable, not all of their libraries are portable).
One good aspect of LLVM is to maintain independence from the target platform at the IR level: you can take the LLVM IR for the program compiled by Kaleidoscope and run on any target platform supported by LLVM, even generate C code and compile on those target platforms that Kaleidoscope does not supports natively. We can say that the Kaleidoscope compiler generates platform-independent code because it does not request any information about the platform when generating the code.
The fact that LLVM provides a compact, platform-independent code presentation is very attractive. Unfortunately, people often only think about compiling C or C-like languages when they ask about language portability. I said “unfortunately”, because in fact it is impossible (in the general case) to make the C-code portable, because, of course, the C source code itself is not portable in the general case, even in the case of porting applications from 32 to 64 bits.
The problem with C (again, in the general case) is that it relies heavily on platform-specific assumptions. As a simple example, a preprocessor will make the code platform dependent if it processes the following text:
#ifdef __i386__
int X = 1;
#else
int X = 42;
#endif
Although it is possible to solve this problem in various complex ways, it cannot be solved in a general way.
But a subset of C can be made portable. If you make primitive types of a fixed size (for example, int = 32 bits, long = 64 bits), don’t worry about ABI compatibility with existing binary files, and give up some other features, then you can get portable code. This makes sense for some special cases.
10.2.2. Security guarantees
Many of the languages mentioned are also “safe”: it is impossible for a program written in Java to spoil the address space and drop the process (assuming that the JVM has no bugs). Security is an interesting feature that requires a combination of language design, runtime support, and often OS support.
It is definitely possible to implement a secure language in LLVM, but LLVM IR alone does not guarantee security. LLVM IR allows unsafe pointer conversions, memory usage after freeing it, buffer overflows, and various other problems. Security should be implemented at a level higher than LLVM and, fortunately, several groups have been researching this issue. Ask on the llvm-dev mailing list if you are interested in the details.
10.2.3. Language specific optimizations
There is one thing in LLVM that many do not like: it does not solve all the problems of the world in one system (sorry, starving children, someone else must solve your problem, not today). One complaint made by LLVM is that it is unable to perform high-level, language-specific optimization: LLVM "loses too much information."
Unfortunately, there is no place to write for you a complete and universal version of the "theory of compiler design". Instead, I will make a few observations:
The first is true, LLVM is losing information. For example, it is impossible to distinguish at the LLVM IR level whether the SSA value was generated from type C "int" or "long" on the ILP32 machine (except from debug information). Both are compiled into a value of type “i32”, and information about the source type is lost. A more general problem is that the LLVM type system considers types with the same structure, not with the same name equivalent. This is another thing that surprises people that if you have two types in a high-level language that have the same structure (for example, two different structures having one int field): these types will be compiled into one LLVM type, and it will be impossible say which initial structure the variables belonged to.
Second, although LLVM is losing information, it does not have a fixed target platform: we continue to expand and improve it in different directions. We add new features (LLVM did not always support exceptions or debugging information), we extend IR to capture information important for optimization (whether the argument was expanded with zeros or a signed bit, information about pointer aliasing, etc.). Many improvements are initiated by users: people want LLVM to have any specific features, and we go towards them.
Thirdly, it is possible to easily add language-specific optimizations, and there are a number of ways to do this. As a trivial example, it’s easy to add an optimization pass that “knows” different things about the source code. In the case of C-like languages, this optimization pass “knows” about the functions of the standard C library. If you call the function “exit (0)” in main (), he knows that the call can be safely converted to “return 0”, because standard C describes what the exit function should do.
In addition to simple knowledge of the library, it is possible to embed other various language-specific information in LLVM IR. If you have specific needs, please write to the llvm-dev mailing list. In the worst case scenario, you can treat LLVM as if it were a “dumb code generator” and implement the high-level optimizations you want in your frontend, in your language specific AST.
10.3. Tricks and Tricks
There are various useful tricks and tricks that you come to after you have worked with / on LLVM, and which are not obvious at first glance. So that everyone does not rediscover them, this section is devoted to some of them.
10.3.1. Implementing portable offsetof / sizeof
One interesting thing is that if you are trying to keep the code generated by your compiler “platform independent”, you need to know the size of the LLVM types and the offset of the specific subviews in the structures. For example, you can pass the type size to a function that allocates memory.
Unfortunately, the size of types can vary greatly depending on the platform: the size of the pointer is the simplest example. A smart way to solve such problems is to use the getelementptr statement .
10.3.2. Garbage Collector Stack Frames
Some languages want to explicitly manage stack frames, often because of = the presence of a garbage collector or in order to make closures easier to implement. Often there are better ways to implement these features than explicitly managing stack frames, but LLVM supports this if you want. To do this, your front-end must convert the code to Continuation Passing Style and use tail calls (which LLVM also supports).