As I wrote the C ++ compiler. Retelling after 15 years

    15 years ago there was no Habrahabr, there was no facebook, and, characteristically, there was no C ++ compiler, with the output of diagnostic messages in Russian. Since then, several new C ++ standards have been released, development technologies have made a giant leap, and it may take several times less time to write your own programming language or code analyzer using existing frameworks. A post about how I started my career and through self-education and writing a C ++ compiler came to an expert level. General details of the implementation, how long it took, what happened in the end, and the meaning of the undertaking is also inside.


    How it all started

    In the distant 2001, when I bought the first Duron 800mhz / 128mb ram / 40gb hdd computer, I quickly set about studying programming. Although no, at first I was constantly tormented by the question, what to install Red Hat Linux, FreeBSD or Windows 98 / Me? The reference point in this endless world of technology for me was the Hacker magazine.
    Old such a mock journal. By the way, since then, the style of presentation in this publication has not changed much.

    Winders, lamers, trojans, elite, linukh - all this blew the roof. I really wanted
    to quickly master this entire stack, which they printed there and hack the Pentagon (without the Internet).
    The internal struggle over whether to become a Linuxoid or hacked into games on Windows continued until the Internet was brought into the house. A 56kb / s modem internet, which occupied the phone for the duration of the connection, and downloaded an mp3 song in the half hour area. With a price of about 0.1 $ / mb, one song pulled 40-50 cents. It is in the afternoon.

    But at night, there were very different prices. It was possible from 23.00 to 6.00 to stick to all sites without disabling images in the browser! Therefore, all that could be downloaded from the network overnight, swung to the screw, and then read already in the afternoon.
    On the first day, when they got me home and set up the network, the admin opened IE 5 and Yandex in front of me. And quickly retreated. Thinking what the first thing to look for on the network, I typed something like "a site for programmers." On that first reference to issue fell recently opened . And I began to hang on it for a long time, feeling dissatisfied with the fact that I understand a little. At that time, C ++ was the flagship and most popular language on the forum (and indeed). Therefore, the challenge was thrown, and there was no choice but to catch up with the bearded uncles in their knowledge of C ++.
    And there was no less interesting site at that time - I still think their method of presentation is the best. In small portions (steps), with small final results. Nevertheless, everything worked out!

    Actively buying books at a flea market, I tried to understand all the basics of programming. One of the first books bought was The Art of Programming - D. Knut. I don’t remember the exact motivation to buy this particular book, and not some C ++ for coffee pots, the seller probably recommended it, but with all my zeal, I started studying the first volume, with the obligatory completion of tasks at the end of each chapter. It was the pulp, and although I didn’t get along with mathematics at school, but with Knut, there was progress, because there was a great desire and motivation to write programs and do it right. Having mastered the algorithms and data structures, I already bought the 3rd volume of “The Art of Programming” Sort and search. It was a bomb. Pyramid sorting, quick sort, binary search, trees and lists, stacks and queues. I wrote all this on a piece of paper, interpreting the result in your head. I read at home, read when I was at sea, read everywhere. One solid theory, without implementation. At the same time, I did not even guess what enormous benefit this basic knowledge would bring in the future.
    Now, conducting interviews with developers, I have not yet met a person who could write an implementation of binary search or quick sort on a piece of paper. It's a pity.

    But back to the topic of the post. Having mastered Knuth, it was necessary to move on. Along the way, I went to Turbo Pascal courses, read Kernigan and Ritchie, and followed by C ++ for 21 days. From C and C ++, I did not understand everything, I just took and rewrote texts from books. There was no one to google or ask, but there was a time car, since I abandoned the school and switched to evening school, which was almost impossible to attend, or to appear 3-4 lessons a week.
    As a result, from morning to night, I fanatically developed, learning more and more new topics. Could write a calculator, could write a simple application on WinApi. On Delphi 6, too, it turned out something to stick. As a result, having received a diploma of secondary education, I was already prepared at the 3-4th year level of the university, and of course, what specialty to study is not an issue.

    Having entered the Department of Computer Systems and Networks, I already freely wrote in C and C ++ tasks of any level of complexity of the university. Although, going to the same, I realized how much more needs to be studied and how experienced forum users pumped me in the pros. This hurt, misunderstanding and at the same time a burning desire to know everything, led me to the book “Compilers. Instruments. Methods Technologies ”- A.Aho, Ravi Networks. In common people called the book of the Dragon. This is where the fun began. Before this book, Herbert Schildt, C ++ Theory and Practice , was read in which he covered advanced development topics such as encryption, data compression, and most interesting - writing your own parser.

    Having started to scrupulously study the book of the dragon, moving from lexical analysis, then to syntax and finally to checking semantics and code generation, a fateful decision came to me - to write my C ++ compiler.
    - And why not, he asked himself?
    “Come on,” answered that part of the brain that, with age, is becoming more skeptical of everything
    new. And compiler development has begun.


    By that time, the modem Internet was blocked for me, due to the change of telephone lines to digital ones, so the 1998 ISO C ++ standard was downloaded for reference . Visual C ++ 6.0 has become an already beloved and familiar tool.
    And in essence, the task boiled down to implementing what is written in the C ++ standard. Help in developing the compiler was a dragon book. And the starting point was the parser calculator from Schildt’s book. All parts of the puzzle came together and development began.


    nrcpp \ KPP_1.1 \
    The second chapter of the ISO C ++ 98 standard includes preprocessor requirements and lexical conventions. That's nice, I thought, because this is the simplest part and can be implemented separately from the compiler itself. In other words, the file preprocessing is started first, the input of which is a C ++ file in the form that you are used to seeing it. And after the preprocessing, the output we have a converted C ++ file, but already without comments, substituted files from #include, substituted macros from #define, saved by #pragma and processed conditional compilation # if / # ifdef / # endif.
    Before preprocessing:
    #define MAX(a, b) \
    	((a) > (b) ? a : b)
    #define STR(s)  #s
     This is the entry point of program
    int main()
    	printf("%s: %d", STR(This is a string), MAX(4, 5));

    After preprocessing:
    int main()
              printf("%s: %d", "This is a string", ((4) > (5) ? 4 : 5));

    On top of that, the preprocessor did a lot of useful work, such as evaluating constant expressions, concatenating string literals, and outputting #warning and #error. Oh yes, have you ever seen digraphs and trigraphs in the C-code? If not, be aware - they exist!
    Example of trigraphs and digraphs
    int a <: 10:>; // equivalent int a [10];
    if (x! = 0) < %% > // equivalent if (x! = 0) {}

    // Trigraph example
    ?? = define arraycheck (a, b) a ?? (b ??) ??! ?? ! b ?? (a ??)
    // maps to
    #define arraycheck (a, b) a [b] || b [a]

    Learn more on the wiki .

    Of course, the main benefit of the C ++ preprocessor is the substitution of macros and the insertion of files indicated in #include.
    What did I learn while writing a C ++ preprossor?
    • How are the vocabulary and syntax of the language
    • C ++ operator priorities. And in general, how expressions are calculated
    • Lines, symbols, contants, postfixes of constants
    • Code structure

    In general, writing a preprocessor took about a month. Not too difficult, but also a non-trivial task, nonetheless.
    At this time, my classmates tried to write the first “Hello, world!”, But at least put it together. Not everyone succeeded. And the following sections of the C ++ standard were waiting for me, with the immediate implementation of the language compiler.

    Lexical analyzer

    nrcpp / LexicalAnalyzer.cpp
    Everything is simple, I already wrote the main part of the vocabulary analysis in the preprocessor. The task of the lexical analyzer is to parse the code into tokens or tokens, which will already be analyzed by the parser.
    What was written at this point?
    • State machine for the analysis of integer, real and symbolic constants. Think it's easy? However, it’s simple when you went through it.
    • State machine for string character analysis
    • Parsing variable names and C ++ keywords
    • Something else like giving a drink. I will remember I will add

    Syntactical analyzer

    nrcpp / Parser.cpp
    The task of the parser is to check the correctness of the arrangement of tokens that were received at the stages of lexical analysis.
    Again, a simple Shieldt parser, pumped to the C ++ syntax level, with a stack overflow check, was taken as the basis of the parser. If we write for example:
    (((((((((((((((((((((((((((((0))))))))))))))))))))))))))))))))); // кол-во скобок может быть больше

    Then my recursive analyzer will eat the stack and tell that the expression is too complicated.
    An attentive reader may have a question. Why reinvent the wheel, after all, there was yacc and lex. Yes there was. But at that stage, I wanted a bike with full control over the code. Of course, in performance it was inferior to the code generated by these utilities. But that was not the goal - technical excellence. The goal was to understand everything.


    nrcpp / Checker.cpp
    nrcpp / Coordinator.cpp
    nrcpp / Overload.cpp

    It occupies chapters 3 to 14 of the ISO C ++ 98 standard, respectively. This is the most difficult part, and I'm sure> 90% of C ++ developers don’t know all the rules described in these sections. For example:
    Did you know that a function can be declared twice, this way:
    void f(int x, int y = 7);
    void f(int x = 5, int y);

    There are such constructions for pointers:
    const volatile int *const volatile *const p;

    And this is a pointer to a member function of class X:
    void (X::*mf)(int &)

    This is the first thing that came to mind. Needless to say, when testing code from the standard in Visual C ++ 6, I often received Internal Compiler Error.

    The development of a language semantics analyzer took me 1.5 years, or one and a half year of uni. During this time, I was almost kicked out, in other subjects besides programming, I got a triple for happiness (well, ok four), and in the meantime the compiler was developed and overgrown with functionality.

    Code generator

    nrcpp / Translator.cpp
    At this stage, when the enthusiasm began to fade a bit, we already have a working version of the front-end compiler. What to do next with this front-end, the developer decides for himself. You can distribute it in this form, you can use it to write a code analyzer, you can use it to create your own converter like C ++ -> C #, or C ++ -> C. At this stage we have a syntactically and semantically valid AST (abstract syntax tree) .
    And at this stage, the compiler’s developer understands that he has comprehended Zen, achieved enlightenment, and can reluctantly understand why the code works in this way. To achieve my goal, to create a C ++ compiler, I decided to end up generating C code, which could then be converted to any existing assembler language or fed to existing Sish compilers (as Straustrup did in the first versions of “C with classes” )

    What is not in nrcpp?

    • Templates (templates) . C ++ templates, this is such a tricky system in terms of implementation, that I had to admit without interfering with the parser and mixing it with semantics - the templates will not work properly.
    • namespace std . You can’t write a standard library without templates. Yes, however, and it would take many, many months, since it occupies the lion's share of the standard.
    • Compiler internal errors . If you play with the code, you can see messages like:
      internal compiler error: in.txt (20, 14): “theApp.IsDiagnostic ()” -> (Translator.h, 484)
      This is either not implemented functionality, or not considered semantic rules.

    Why write your bike?

    And in conclusion, I want to note what this post was written for. Writing your bike, even if it took more than 2 years, has been feeding me so far. This is invaluable knowledge, a base that will be with you throughout the career of a developer. Technologies, frameworks will change, new languages ​​will come out - but the foundation in them will be laid from the past. And their understanding and development will take very little time. - source code of the compiler. You can play the right file in.txt and watch the output in out.txt. - sources of the preprocessor. Assembled using Visual C ++ 6.

    Also popular now: