Mono.Cecil: make your own “compiler”

    One of the most luxurious topics for programmers indulging in the invention of bicycles is writing their own languages, interpreters and compilers. Indeed, a program that is capable of creating or executing other programs instinctively instills awe in the hearts of coders - because it is complex, voluminous, but insanely exciting.

    Most start with their own interpreters, which in general are a huge switch of commands in a loop. Interesting, free, but dreary and very slow. I want something more nimble to JIT'it skillfully and, preferably, she herself followed the memory.

    An excellent solution to this problem is to choose .NET as the target platform. Let's leave the lexical analysis the next time, and today let's try to make a simple program that creates a working executable: The program will require a name and display Hello,% username% in the console. There are many ways to create an executable, for example:







    • Translation in C # code and calling csc.exe: simple but unsportsmanlike
    • Generating IL-code in text form and compiling ilasm.exe: inconvenient because of the need to hand-write a huge manifest
    • Generating an assembly directly using Reflection or Cecil

    Just the last option I chose. Unfortunately, I don’t know what Cecil is superior to Reflection for this task, but I came across an example on Cecil, so I’ll analyze it exactly.

    Mono.Cecil is a library that allows you to work with the assembly as an array of bytes. With its help, you can both create your own assemblies, and pick and modify existing ones. It provides a wide range of classes that are (usually) convenient to use.

    Subject of conversation


    Here, in fact, the finished code (without a description of the class, form, and everything else, except the actual generator method):

    using Mono.Cecil;
    using Mono.Cecil.Cil;
    public void Compile(string str)
    {
      // создаем библиотеку и задаем ее название, версию и тип: консольное приложение
      var name = new AssemblyNameDefinition("SuperGreeterBinary", new Version(1, 0, 0, 0));
      var asm = AssemblyDefinition.CreateAssembly(name, "greeter.exe", ModuleKind.Console);
      // импортируем в библиотеку типы string и void
      asm.MainModule.Import(typeof(String));
      var void_import = asm.MainModule.Import(typeof(void));
      // создаем метод Main, статический, приватный, возвращающий void
      var method = new MethodDefinition("Main", MethodAttributes.Static | MethodAttributes.Private | MethodAttributes.HideBySig, void_import);
      // сохраняем короткую ссылку на генератор кода
      var ip = method.Body.GetILProcessor();
      // магия ленор!
      ip.Emit(OpCodes.Ldstr, "Hello, ");
      ip.Emit(OpCodes.Ldstr, str);
      ip.Emit(OpCodes.Call, asm.MainModule.Import(typeof(String).GetMethod("Concat", new Type[] { typeof(string), typeof(string) })));
      ip.Emit(OpCodes.Call, asm.MainModule.Import(typeof(Console).GetMethod("WriteLine", new Type[] { typeof(string) })));
      ip.Emit(OpCodes.Call, asm.MainModule.Import(typeof(Console).GetMethod("ReadLine", new Type[] { })));
      ip.Emit(OpCodes.Pop);
      ip.Emit(OpCodes.Ret);
      // регистрируем тип, к которому будет привязан данный метод: все параметры выбраны
      // опытным путем из дизассемблированного экзешника
      var type = new TypeDefinition("supergreeter", "Program", TypeAttributes.AutoClass | TypeAttributes.Public | TypeAttributes.AnsiClass | TypeAttributes.BeforeFieldInit, asm.MainModule.Import(typeof(object)));
      // добавляем тип в сборку
      asm.MainModule.Types.Add(type);
      // привязываем метод к типу
      type.Methods.Add(method);
      // указываем точку входа для исполняемого файла
      asm.EntryPoint = method;
      // сохраняем сборку на диск
      asm.Write("greeter.exe");
    }
    


    Now more carefully about the creepy looking central part, which, in fact, generates code.

    What is going on there?


    Written in C #, the same program would look like this (I will omit the class description):

    static public void Main()
    {
      Console.WriteLine("Hello, " + "username");
      Console.ReadLine();
    }
    


    To do this, we take two lines, the first is a constant, the second is determined at the compilation stage and also becomes a constant, we put them on the stack. String.Concat adds these lines and leaves the result on the top of the stack, which is taken by Console.WriteLine and displayed.

    After that, so that the program does not close before we can read something, we need Console.ReadLine () - and since it returns a read line that we do not need, we throw it out of the stack, and then with a sense of accomplishment we leave already almost a native function of Main.

    About Bytecode


    We generate a program for the .NET virtual machine, and the body of the method consists, obviously, of its commands. .NET is a stacked virtual machine, so all operations are performed with operands lying on the stack. A complete list of them can be found on Wikipedia , but I will only talk about those that I used in more detail.

    LDSTR pushes a string onto the stack. Obviously, it needs a string as a parameter. In fact, “loads a string onto the stack” means that the string itself is not pushed onto the stack, but only a pointer to the place in memory where it is located - but for us, as for the IL programmer, this is not important. The only important thing is that the following commands will be able to take and use it from there.

    Call, as the name suggests, calls the method. To do this, he needs to pass a link to the object with a description of this method itself, which must first be imported. For import, you should “find” the method in the type, passing the name and list of types of its parameters in the form of an array - this is why the record is so terrible. In a good way, here it would be necessary to write some kind of handler that converts a string of the form “String.Concat (string, string)” into this horror - you can try to do this.

    POP pops the top item from the stack. Nothing special. We need it because Console.ReadLine () returns a value, and our function is void, therefore we cannot leave it there and must clear it.

    Ret- from the word return, exits the current function. It must be at the end of each function, and maybe not one - depending on how many exit points you have from it.

    Work results



    In the end, compiling and running the program, entering your name there and pressing the weighty Compile button, we get in the same folder a miniature greeter.exe binary that weighs exactly 2048 bytes.

    We launch it, and voila!

    Also popular now: