As I wrote the C compiler in 40 days

Original author: Rui Ueyama
  • Transfer
I suggest you translate the diary of Rui Ueyama, a programmer from Google, who he led while working on the implementation of the C language compiler about three and a half years ago (but published only last December).
This diary does not have any practical benefit and is not a tutorial, but I was very interested to read it, I hope you will like this story too :)

I wrote a C compiler for 40 days, which was called 8cc. This is a diary written by me at that time. The code and its history can be viewed on GitHub .

Day 8

I am writing a compiler. It started working after writing about 1000 lines of code. Here are some examples that already work:

int a = 1;
a + 2;  // => 3
int a = 61;
int *b = &a;
*b;     // => 61

Arrays are correctly converted to pointers, so the code below also works. Function calls are also supported.

char *c = "ab" + 1;
printf("%c", *c); // => b

It was not difficult to implement this, as I am doing this for the second time. I learned how to handle arrays and pointers better.

Day 15

I have come a long way in implementing the compiler and it works surprisingly well. Non-trivial programs, for example this one - the solving task of eight queens , are compiled and launched.

Of course, he lacks many features. These sample programs are chosen so as not to use them.
The implementation is quite simple; not even register allocation.
It compiles the source programs into the code of the stack machine, which uses the system stack as a stack. Each operation requires a memory access. But for now, it suits me.

At the beginning, the compiler fit in about 20 lines and the only thing it was able to do was read the integer value from the standard input and run a program that immediately ends returning the integer.

Now it contains about 2000 lines. If you look at git, it seems that it developed in this way:
  • added "+" and "-"
  • phases of parsing and code generation are separated
  • added "*" and "/"
  • variables added (implicitly implying int type)
  • function call added
  • lines added
  • separate tokenizer and syntax analysis
  • base type declaration support
  • pointers and arrays added
  • support for array initialization expressions
  • added “if”
  • function declaration supported
  • added for and return
  • pointer assignment supported
  • added "=="
  • Array indexing and pointer arithmetic added
  • added "++", "-" and "!"

Day 17

I have successfully implemented the structures. A structure is an object that can occupy more than one machine word. They are harder to implement than primitive types, but it was easier than I expected.

It seems to work as it should; I can define a structure containing a structure. I can define a pointer to a structure and dereference it. Structures containing arrays and arrays of structures also work. Although I already knew that the code should theoretically work, I was still happy when it really worked, even in such a difficult case.

However, I do not feel that I fully understand why this code is working correctly. It feels a little magical because of its recursive nature.

I cannot pass structures to functions. In the calling convention for x86, the structure is copied onto the stack and a pointer to it is passed to the function. But in x86-64 you have to split the structure into several pieces of data and pass them through registers. It’s difficult, so I’ll postpone it for now. Passing structures by value is less necessary than passing pointers to them.

Day 18

It was easier to implement associations, because it's just a variant of the structure in which all fields have the same offset. The "->" operator is also implemented. Easy peasy.

Organizing floating point support was hard. It seems like implicit type conversion between int and float works, but floating point numbers cannot be passed to functions. In my compiler, all function parameters are first pushed onto the stack and then written to registers in the order specified in the x86-64 calling convention. But there is apparently a bug in this process. It returns a memory access error (SIGSEGV). It is difficult to debug, considering assembler output, because my compiler does not optimize the assembler for reading. I thought I could finish it in a day, but I was wrong.

Day 19

I wasted time because I forgot that in accordance with the x86-64 calling convention, the stack frame should be aligned to 16 bytes. I found that printf () crashes with SEGV if I pass multiple floating point numbers to it. I tried to find the conditions under which this can be reproduced. It turned out that the position of the stack frame matters, which made me think about the requirements of ABI x86-64.

I didn’t take care of this at all, so the stack frame was aligned only by 8 bytes, but print () didn’t complain until it accepted only integers. This problem can be easily fixed by adjusting the stack frame before calling the CALL instruction. But such problems cannot be avoided unless you carefully read the specification before writing the code.

Day 20

I changed the indentation in the compiler code from 2 to 4. I’m more used to using 2 white space indents, because these are used at my work at Google. But for some reason, I think the 4-space indentation is more suitable for a "beautiful open source program."

There is another, more significant, change. I rewrote tests from shell scripts to C. Prior to this change, each test function compiled by my compiler was associated with main () which was compiled by GCC and then run by a shell script. It was slow because spawned many processes for each test. I had no choice when I started the project, because my compiler did not have many functions. For example, he could not compare the result with the expected value due to the lack of comparison operators. Now it is powerful enough to compile test code. So I rewrote them to make them faster.

I also implemented large types such as long and double. Writing the code was fun because I very quickly succeeded in implementing these functions.

Day 21

I almost finished implementing the C preprocessor in one day. This is actually a port from my previous attempt to write a compiler.

Implementing the C preprocessor is not an easy task.

This is part of the C standard, as defined in the specification. But the specification says too little for it to be useful for self-implementation. The specification includes several macros with their expanded form, but it says very little about the algorithm itself. I think she doesn’t even explain the details of his expected behavior. In general, it is underspecified.

As far as I know, pdf in this blogthe only and best resource on how to implement the C language preprocessor. The algorithm described in the document (Dave Prosser algorithm) tries to deploy as many tokens as possible, avoiding the endless expansion of the macro. Each token has its own deployment history, so tokens are not deployed by the same macro more than once.

The C preprocessor itself is an independent language. It has many features, and only seasoned C programmers understand it well.

Day 22

Tried to get the compiler to read system headers, so now it understands #include. While I tried, I was getting a lot of errors. This revealed that my preprocessor still lacks many functions, for example, operators that can only be used in #if. There are many bugs besides this. I corrected them as soon as I found.

System header files are large and confusing. They require many functions from the compiler, such as enum and typedef. I implemented them one by one, but sometimes I cut corners. I am trying to read stdio.h. I have no idea how long it can take.

The compiler now consists of 4000 lines. The small LCC compiler contains 12,000 lines. Using it as a guide, I think my compiler will soon be able to work like a real C compiler.

I am surprised to have written 500 lines of code today. I can work in the stream for 12 hours, but it can be inefficient, because I get tired without noticing it. In any case, I must admit that I am a person with a lot of free time.

Day 24

I do not remember what I fixed, but now stdio.h can connect. This is very cool, because the types of functions defined in the header file are now correctly processed.

The scheme that I use to implement my compiler implies creating a compiler for a small subset of the language and then developing it into a real C language. Until recently, I did not try hard to implement functions that I do not fully understand. I could write as much code as I wanted and leave the rest. It was fun.

External things, such as the header system, have caused many problems. Now I have to implement all the functions that are expected from the "real" C compiler. I made a lot of dirty hacks to read stdio.h. For example, I implemented a hack to ignore all occurrences of the “const” token. It upsets me.

You may ask why not do it in the right way from the start. I would say that this is not fun. Too. For example, the syntax for declaring types in C is too confusing for no reason and it’s not at all interesting to implement it.

Despite this, there are some things that I can’t avoid. I should probably change my mind to realize all the features, from start to finish. I can find it interesting as I get closer to the goal. Sometimes, I have to write more code than I want in order to achieve the goal.

Day 25

I was stuck for two days with implementing the syntax of definitions and declarations without any success. Why can't I finish this? I made pointers and structures in one day of work. I feel that I underestimated this. Maybe I need a plan?

Day 26

In a difficult situation like this, I probably should remember the fact that the compiler was in just one file to see how I progressed in a month. He simply read the integer through scanf () and printed it through printf (). Indeed, I have made very serious progress in one month. Yes, I think I can do it.

Day 28

Finished writing a parser for declarations and definitions. I think the reason I was failing was because I tried to write too many details from the very beginning, so I wrote a pseudocode to make sure that I understood everything correctly and then converted it to real code.

I have been writing C for almost 15 years, but only today I felt that I finally understood the type syntax in C. It is not surprising that I could not write working code. This is because I just did not understand it correctly.

The code I just wrote is too complex and fragile that even I can hardly understand it. I do not believe that Dennis Ritchie, creator of C, understood the consequences of what he was doing. I suspect that he came up with syntax, wrote code that was more complicated than he expected, and, in the end, was standardized by the ANSI committee. It is difficult to implement a standardized language because you have to do everything right. It is rather easier to write your own toy language.

Day 29

I implemented many more operators and cleaned the code.

Today, for the first time, my compiler was able to compile one of my files. When I linked it to other files compiled using GCC, it turned out to be working. And the resulting compiler also seems to work. It looks like the target is getting closer.

Day 30

I implemented switch-case, continue, break and goto today. When I wrote test cases for goto, the test code quickly turned into spaghetti code that I could not read. That made me laugh. I made sure why goto is considered harmful .

Day 31

Implemented functions for varargs, namely va_start, va_arg and va_end. They are not used often, but I needed them to compile functions like printf.

The vararg specification for C is not well thought out. If you pass all the arguments to a function through the stack, va_start can be implemented quite easily, but on modern processors and modern calling conventions, arguments are passed through registers to reduce the overhead of calling functions. Therefore, the specification does not correspond to reality.

Roughly speaking, ABI for x86-64, standardized by AMD, requires functions with a variable number of arguments to copy all the registers onto the stack in order to prepare for the next call to va_start. I understand that they had no other choice, but it still looks awkward.

I wondered how other compilers handle functions with a variable number of arguments. I looked at the TCC headers and it looks like they are not compatible with the ABI x86-64. If the data structure for varargs is different, then the functions passing va_list (such as vprintf) become incompatible. Or I'm wrong? [And I'm really wrong - they are compatible.] I also looked at Clang, but it looks confusing. I did not read it. If I read too much code from other compilers, this can ruin the fun from my own implementation.

Day 32

After fixing minor problems and adding escape sequences for string literals (there were still no '\ 0' and similar things), it turned out to compile another file. I feel confident progress.

I tried to implement support for functions with more than six parameters, but could not finish it in one day. In x86-64, the first 6 integer parameters are passed through the registers, and the remaining through the stack. Now only transfer through registers is supported. Passing through the stack is not difficult to implement, but it takes too much time to debug. I think in my compiler there are no functions that have more than six parameters, so for now I will postpone their implementation.

Day 33

Three more files compiled today. Total 6 out of 11. But if you count the lines of code, then this is about 10% of the total. The remaining files are much larger, because they contain the code of the compiler core.

Even worse, I use relatively new C features in kernel files, such as compound literals and designated initializers. They greatly complicate self-compilation. I shouldn't have used them, but rewriting code in plain old C would not be productive, so I want to support them in my compiler. Although it will take time.

Day 34

A few notes on debugging tools. Since the compiler is a complex piece of code that consists of many steps, you need a way to somehow examine it for debugging. My compiler is no exception; I implemented several functions that I found useful.

Firstly, the lexical analyzer remembers its reading position and when it is interrupted for unforeseen reasons, it returns this position. This makes it easy to find a bug when the compiler does not accept the correct input.

There is a command line option for printing an internal abstract syntax tree. If there is an error in the parser, I want to look at the syntax tree.

The code generator allows widespread use of recursion because it generates fragments of assembler code when it traverses an abstract syntax tree. So I was able to implement printing a mini stack trace for each line of assembler output. If I notice something wrong, I can follow the code generator by looking at its output.

Most internal data structures have functions for converting to a string. This is useful when using printf for debugging.

I always write unit tests when I write a new function. Even having implemented it, I try to keep the code compiled in order to run the tests. Tests are written to run in a short amount of time, so you can run them as often as you want.

Day 36

Implemented compound literals and rewrote the initializer of structures and arrays. I did not like the previous implementation. Now the initializer has become better. I had to write beautiful code from the very beginning, but since I understood this only by writing working code, rewriting was inevitable.

I think the only feature that is not enough for self-compilation is the assignment of structures. I hope everything will work as intended without much debugging when it is implemented.

Day 37

Файл, содержащий токенайзер скомпилирован, но получившийся компилятор второго поколения по некоторым причинам не генерирует корректный ассемблерный код. Хотя код, генерируемый компилятором первого поколения проходит все тесты. Такой вот коварный баг.

Думаю, что у меня нет другого выбора, кроме использования printf для отладки, потому что второе поколение компилируется через мой компилятор, который не поддерживает отладочную информацию. Я добавил printf в подозрительных местах. Отладочные сообщения printf отображались при компиляции второго поколения, что меня несколько удивило. Я хотел чтобы отладочные сообщения выводились только когда я использую второе поколение, так что я не ожидал что вывод будет работать, когда второе поколение только создается.

It reminds me of the movie "Beginning." We must go deeper to reproduce this bug. This is the fun part of debugging a self-compiling compiler.

Day 38

I fixed the problem that occurred in the second generation if the lexical analyzer was self-compiled. It caused a bug in which -1> 0 sometimes returned true (I forgot about the sign extension). There is another bug in the placement of structures (struct layout). There are only three files left.

Day 39

The code generator can now compile itself too. There are two files left. The work is almost done, although I may not be too optimistic. Unexpected pitfalls may still remain.

I fixed many problems caused by the poor quality of the code that I wrote at an early stage of this project. It bored me.

I believed that I had every opportunity for self-compilation, but this is not true. There are not even prefix increment / decrement operators. For some of the features of C99, I rewrote the compiler part to make it more convenient to compile. Since I did not expect to get to the possibility of self-compiling so quickly, I used as many new C features as I wanted.

Day 40

Hurrah! My compiler can now compile itself completely!

It took about 40 days. This is a very short time to write a self-compiling C. compiler. Don't you think so? I believe my approach is to first make a small compiler for a very limited subset of C, and then convert it to a real C compiler worked very well. In any case, today I am very happy.

Looking at my code, even knowing that I wrote it myself, I feel it is a little magical, because it can accept itself at the input and convert it to assembler.

There are many bugs and unrealized features. I’m probably going to finish with them, and then I will start work on improving the output code.

Source code is here.. I don’t know if it’s worth reading, but it may be interesting to see it as a C code sample that a simple 5000 line compiler can handle.

Day 41

He returned to systematic development, since a big milestone was reached. I changed the code, trying to read it as if I were a third party, and, as a result, was satisfied with the quality of the code. This is reminiscent of bonsai tree pruning.

I added a test to restart self-compilation to make sure that the results are the same for the second and third generation files. If I remember correctly, GCC has a similar test.

Day 42

There is some information that can be transmitted to the next generation only through executable files in binary format, despite the fact that it is not written to the source code of the compiler.

For example, my compiler interprets "\ n" (the backslash sequence and the character "n") into the string literal "\ n" (in this case, a newline character). If you think about it, you may find this a little strange because it does not have information about the real ASCII code for "\ n". Information about the character code is not present in the source code, but is transmitted from the compiler compiling the compiler. My compiler newlines can be traced back to GCC.
Compilers can convey much more information than a character code.

This amazing story was presented by Ken Thompson in his lecture on receiving the Turing Prize. The information that Ken added to an earlier version of the Unix compiler enabled the user login program to accept some special passwords, so Ken could log into any Unix account. He also made the compiler recognize its own compilation and pass the hack of the input program to the child compiler, so that this backdoor (which was not in the source code) was passed from generation to generation. You could not delete it, even if you carefully examined every line of the compiler source code and recompiled it, because the compiler that processes the source code was infected. This is an amazing story, isn't it?

Day 43

I rewrote the implementation of the parser for operators on the method of recursive descent instead of using the primacy of operators (operator-precedence parser). I think the reason for choosing a parser using operator precedence is simplicity and speed, but the grammar used in for C operators is too confusing to handle in this way (for example, array indices or various kinds of unary operators). Code is now easier to read than before because large functions are broken down into many small ones. It should also now be easier to add error checking to the parser.

Parser writing techniques are one of my most useful skills as a programmer. He has helped me countless times.

However, when I read the C language specification to write a parser for grammar using recursive descent, I found that some derivatives are left recursive. I had to think for a while and open the textbook again to remember how to rewrite the grammar to make it right-recursive. Eliminating left recursion is a basic topic in parsing, which is described in any introductory textbook. But I could not recall such elementary things, since I had not used the technique for a long period of time.

Day 44

The input data is now converted in this way: character string → token sequence → token sequence after macro substitution → abstract syntax tree → x86-64 assembler. The last transition, perhaps doing too much and too confusing. It performs various kinds of operations, such as implicit type conversions and expanding label names while generating assembly code. Theoretically, I probably should define an intermediate language between the ASD and the assembler.

I read Dragon Book on this topic again, without feeling that I fully understood it. The book is too abstract, so I can not immediately apply it to my case.

My compiler is twice as slow as GCC when it compiles itself. This is not as bad as I thought. My compiler generates terribly artless assembler, but such code is not slower by an order of magnitude.

Day 45

I was wondering how gcov would work on my code. He found many branches of code that failed unit tests. I found several bugs by adding tests for these code branches. Code coverage tools are really useful.

Day 46

I feel like I'm out of ideas on what to do with the compiler. It was not difficult for me to get to the current state of development, because I did not have to learn new things, but things outside this point seem difficult.

I pulled the implicit type conversion from a code generator. They are now clearly represented in the ASD. This makes it easy to understand what is happening under the hood. I also made haphazard improvements in various places. I thought that the work was almost completed, but in reality there were many unrealized features and errors.

I began to better understand compiler optimization after reading a few more pages from Dragon Book. I could start writing code if I figure it out a little better.

Day 52

I searched for a mysterious bug for three days: if you round up or down the stack pointer to align to a 16-byte border, the second generation compiler displays an error message for the correct input. These kinds of errors are harder to debug than regular fatal errors.

My first guess was: the stack pointer is not aligned correctly. But this is not so, because the stack pointer was already correctly aligned on the 16-byte boundary. I could not find bugs in assembler output. I decided to split in half.

I tried to compile each file through my compiler, and the rest through GCC to determine the function with which I can reproduce the problem. But it seemed that the function did not contain errors. This is not a function that displays an error message - they are not even in one file. The theory is that one function creates incorrect data that causes errors in other functions.

After lengthy haphazard debugging, I finally found a reason: the compiler does not initialize structure fields with zeros. The C specification requires that when initializing a structure, fields that are not initialized with values, the compiler must automatically fill with zeros. I knew the specification when I wrote the code (so my code depends on this behavior), but I forgot to implement it. As a result, some structure fields were initialized with garbage instead of zeros. The garbage data changes depending on the current position of the stack, so changing the stack pointer randomly changed the behavior of the program.

As a result, after a three-day debugging, only one line was fixed. I would like to have some means to simplify such debugging.

Day 53

Another mysterious bug has been fixed. If you compile some files with GCC, and the rest with my compiler, the resulting compiler reports syntax errors with valid input. Such issues should be related to ABI compatibility. My first assumption was that the problem might be related to the markup of the structures or how the arguments are passed to the functions, but the assembler output seems to be correct for them.

Again, I narrowed down the search to a specific file, shared in half. I transferred functions in turn from one file to another to find the function that was causing the problem. This function was not small, so I separated it and transferred the code to another file. Finally I got some lines of code. I compiled them using GCC and my compiler and compared.

Единственная разница заключалась в этом: мой компилятор проверяет все биты в регистре, чтобы определить содержит ли он булево значение true, тогда как GCC проверяет только младшие 8 бит. Таким образом, например, если значение в регистре 512 (= 29 или 0x100), то мой компилятор воспринимает его как true, тогда как GCC считает по-другому. GCC на самом деле возвращает некоторое очень большое число с нулями в наименее значимых 8 битах, что воспринимается как false.

Due to this incompatibility, a loop that uses a function compiled using GCC to check the exit condition (the loop itself is compiled by my compiler) stops immediately at the first iteration. As a result, preprocessor macros are not defined. And some of the tokens remained undefined and therefore the parser reported syntax errors with some input data. The reason was far from where the error was reported, but I finally caught it.

The x86-64 ABI specification has a quick note that only the lower 8 bits are significant for logical return values. I read this, but for the first time I did not understand what this means, so I did not even remember that such an indication existed. But now for me it is very clear. I have mixed feelings - I learned new things, but I could learn them without spending so much time on it.

Day 55

Implemented bit fields.

You can pack several variables in a small area using bit fields, but the compiler should still create space between them depending on their types. My quick research on the GCC conclusion revealed the following rules (of course, without guarantee that they are correct):

  • Выравнивание первого битового поля подчиняется выравниванию типа. Например, для «int x:5», x выравнивается по 4-байтовой границе (предполагается, что int — это 32 бита).
  • Любая переменная не должна пересекать свои естественные границы. Например, для «int x:10», x не должен пересекать 4-байтовую границу. Если оставшееся число битов слишком мало, чтобы удовлетворить этому условию, они остаются неиспользованными и следующее битовое поле начинается в следующих границах.

Accessing a bitfield requires more than one machine instruction, because the CPU has no instructions for accessing individual bits in memory. You must read the memory containing the bit field into the register, and then use the & mask to read the bit field. When you write it to memory, you must first read the memory containing the bit field, apply the bit mask &, the bitwise “or”, to get a new value, and then write the value back to memory.

Day 56

Implemented computed goto (computed goto). A regular goto can only reference one label, but a computed goto can take a pointer and pass control to that address. This is not in the C standard and exists as an extension to GCC. If you have ever written a virtual machine with a very large switch-case, you are most likely aware of this possibility. This function is often used to replace a large switch-case switch with a conversion table and computed goto.

The computed goto compiles into a single indirect jump instruction. This is probably easier to understand in terms of assembler. My compiler generates assembler, so it was very easy to implement.

Day 57

Implemented C11 _Generic because I wanted to use it in my compiler. But after implementation, I noticed that GCC 4.6.1 does not support _Generic. I cannot use this function because if I use it, then GCC will not be able to compile my compiler.

I also implemented the typeof () function, which is also an extension of GCC. Both of these functions are not used at the moment, but this is normal, since very little code was required to implement them.

Day 58

Added digraphs from C99. Digraph is an unusual function for specific environments in which it is impossible to use some characters. For example, "<:" is defined as an alias for "[". Digraphs can be converted to ordinary characters during tokenization, so this is useless, but also harmless.

There are trigraphs in C89 that are obviously harmful. Trigraphs are three-letter sequences of characters that are converted to one character wherever they appear in the source code. For example, printf ("huh ??!") Will output not "huh ??!", But "huh |" because "??!" This is a trigraph for "|". This is very confusing. I have no plans to support the trigraphs.

Day 62

I am trying to compile TCC . But so far I am far from successful due to unrealized functions and errors.

TCC is a small C compiler that ranges in size from 20K to 30k lines of code. If you remove support for architectures other than x86-64, the number of lines is likely to be around 10K-20K. It's amazing to see that such a small compiler supports so many functions. Fabrice Bellar , creator of TCC, is a genius.

I tried several times to read the TCC source code, but still do not understand the whole picture. Compilers are complex programs, so you usually want to break them into small manageable parts, but the TCC source code feels like a monolithic compiler. I can’t write such great code. I don’t know if I want to imitate, but the fact that there is someone in the world who can write such code really impresses me.

Day 73

Unsuccessfully continued to work on compiling TCC. Of course, this is difficult, because if it passes, it means my compiler has the same functions as TCC. Compiling other compilers is useful for debugging, but by its nature it is nitpicking every detail of the language specification.

Also popular now: