Papa Carlo and incremental compilers



    Colleagues,

    and remember there was such a translation article on Habr Checklist of the programming language developer Colin Macmillen about the problems of new programming languages? The article is simply amazing! If you haven’t read it, be sure to check it out.

    One of the key issues Colin talks about: languages ​​without good IDE support no one needs. Of course, this is not the only problem that the developer of the programming language faces. But, I think everyone will agree that, ceteris paribus, a language supported by many editors will already have a good competitive advantage.

    By coincidence, I’ve been dealing with compilers and language plugins for IDEs for many years. And I will be glad to share my experience with you about how to make a compiler that will be much easier to integrate with many modern code editors. And at the same time I’ll talk a little about my own developments in this area.


    Problem


    It's no secret, friends, that many of us have thought about creating our own language, and some have even tried or are going to try. The process is very creative, and fascinating. If you do not go into details, then in general terms, the compiler consists of the following components:

    • A parser that parses the source code and turns it into a syntax tree.
    • A semantic analyzer that connects the essence of the language with each other: links between names, variables, classes.
    • The backend responsible for generating and optimizing machine code.


    Most compilers work in a single pass: the programmer submits the source, and the output receives the finished program, or a list of syntax errors that need to be fixed. Depending on the size of the source and the complexity of the language, the compilation process can take from seconds to minutes.

    When developing a language plugin for a code editor, say, for a language with static typing like Java, this approach is not applicable. That is, you can’t keep the programmer waiting until our compiler recompiles the entire project and checks to see if there are any errors in the code after one small change. Of course, if we want to do something not quite so trivial as syntax highlighting, but at least display syntax errors in real time.

    The approach with a full recompilation of the project will not be applicable to the IDE, even if we disable the compiler backend. As the amount of source code in the project increases, compilation time will still increase.

    The so-called incremental approach comes to the rescue . The idea is that the compiler caches almost all the intermediate results of its work: the results of compiling individual files, their syntax trees, and even individual language tokens. And if the user-programmer makes small changes to the source code, the compiler will parse only some local piece of code around these changes. Thus, the compiler's performance becomes proportional to the changes made to the code, and not to the volume of the entire code.

    Unfortunately, developing a parser for an incremental compiler is a fairly non-trivial task. Especially considering that the parser must also be able to parse code that contains syntax errors. For example, if a programmer makes a syntax error at the beginning of a class declaration:

    import javax.swing.JApplet;
    import java.awt.Graphics;
    public class Hello extends JApplet {
        int x = // Я начал объявлять переменную, но еще не закончил.
        public void paintComponent(final Graphics g) {
            g.drawString("Hello, world!", 65, 95);
            // Но это не помешает мне продолжить писать код тут.
        }
    }
    


    In the method below, the programmer can still write code that the editor will understand: the code-completion, jump-to-definition, and many other IDE functions will be available to the user.

    There are few ready-made generators and combinators of incremental parsers, and they are very specific. Say, in a monstrous product like ANTLR , the latest version added support for incremental parsing in some form. I must say that ANTLR is a rather heavy product, working with it is much more difficult than with some PEG combinator of regular (non-incremental) parsers like PEG.js in JavaScript.

    We have to admit the sad fact that today, with rare exceptions, the development of language plugins for each individual text editor or IDE is more or less “on the knee”. And it is a rather difficult task, from which independent products often grow.

    Decision


    I am currently working on a Papa Carlo project that will help greatly simplify the task of creating language plugins for the IDE. This is a library on the Rock, which allows you to build a fully functional incremental parser, suitable for creating a language plug-in, or even a full-fledged incremental compiler.

    The developer sets the grammar of the language directly in the code on the Rock using the API of this library. And the resulting parser can parse, including code containing syntax errors, and create a syntax tree directly “out of the box”. There is no additional step in code generation. The parser is created in runtime, like many modern combinators of ordinary parsers like the same JParsec for Java.

    Then the developer associates the outputs of the created compiler with the API of those code editors that he wants to support. For example, with the Sublime Text API .

    In my opinion, this approach is much simpler than developing separately the compiler and language plugins from scratch.

    The project has not yet been completed, but I have already uploaded a working version on GitHub under the Apache license , and have done some experiments. For example, there is a ready incremental parser for JSON files. A parser is defined by exactly one grammar file in Scala. The parser code can be found here .

    In one of the tests, here is a json containing explicit syntax errors:

    {
      "key 1": "hello world",
      "key 1.1":
      "key 2": ["array value 1", "array value 2", "array value 3"],
    }
    


    Nevertheless, at the output, the parser quite successfully parses those parts that do not contain errors. And he creates such a tree:

    object 1 {
      entry:
        entry 27 {
          key: "key 1"
          value:
            string 26 {
              value: "hello world"
            }
        }
      entry:
        entry 25 {
          key: "key 1.1"
          value:
            string 24 {
              value: "key 2"
            }
        }
    }
    


    Pointing to syntax errors, of course:

     > code mismatched:
    {
      "key 1": "hello world",
      "key 1.1":
      "key 2"<<<: ["array value 1", "array value 2", "array value 3"],>>>
    }
    


    However, another example is much more interesting, in which a relatively large file with a size of 600 lines is introduced . After the first start, the parser safely creates a syntax tree, and 0.27 seconds works. Which, in general, is not enough. Then, small changes are made to the file twice, and on the second and third starts, the parser already works for 0.007 and 0.008 seconds, respectively. Similarly, creating a syntax tree for all 600 lines of these new files. This effect is achieved precisely due to the use of the cache obtained during previous parser launches.

    Input file Size (Rows) The difference with the previous (line) AST parsing and building time (milliseconds)
    step0.json 634 - 270
    step1.json 634 1 7
    step2.json 634 2 8


    Conclusion and references


    Unfortunately, the format of the article makes it possible to outline the topic only in the most general terms, omitting the details. Nevertheless, I hope that I managed to convey the very essence of the problem of incremental compilation and the work of language extensions for code editors.

    I am sure that among us there are still developers who have experience in creating extensions for the IDE. It would be very interesting to hear your additions and comments.

    Here are some useful links at last:
    • Grammar Kit . JetBrains parser constructor used in plugin development for IntelliJ Idea.
    • Java Development Tools . The incremental Java parser used internally by Eclipse.
    • Parboiled . Non-incremental parser constructor for Scala and Java. Despite the fact that he builds ordinary, non-incremental parsers, this is one of the most developed and famous parser constructors in the Scala community. In my opinion, the project deserves attention.
    • Papa Carlo . My own incremental parser constructor for Scala, mentioned above.

    Also popular now: