ph_piter March 20, 2019 at 14:15

DOTS Stack: C ++ & C #

Transfer

This is a brief introduction to our new Data Oriented Technology Stack ( DOTS ). We will share some insights to help you understand how and why Unity has become just like that today, and also tell you in which direction we plan to develop. In the future, we plan to publish new articles on the DOTS blog on the Unity blog.

Let's talk about C ++. This is the language in which modern Unity is written.
One of the most complex problems a game developer has to deal with one way or another is this: the programmer must provide an executable file with instructions that are clear to the target processor, and when the processor executes these instructions, the game should start.

In the part of the code that is performance sensitive, we know in advance what the final instructions should be. We just need a simple way that allows us to consistently describe our logic, and then check and make sure that the instructions that we need are generated.

We believe that the C ++ language is not too good for this task. For example, I want my loop to be vectorized, but there may be a million reasons why the compiler will not be able to vectorize it. Either today it is being vectorized, and tomorrow it is not, due to some seemingly trifling change. It’s hard to even make sure that all my C / C ++ compilers will even vectorize my code.

We decided to develop our own “quite convenient way to generate machine code” that would meet all our wishes. We could spend a lot of time in order to slightly bend the whole sequence of C ++ design in the direction we need, but we decided that it would be much more reasonable to invest our energy in developing a tool chain that would completely solve all the design problems that confront us. We would develop it taking into account precisely those tasks that the game developer has to solve.

What factors do we prioritize?

Performance = correct. I should be able to say: “if for some reason this loop is not vectorized, then this must be a compiler error, and not a situation from the category“ oh, the code began to work only eight times slower, but still gives true values, business something! ”
Cross platform. The input code that I write should remain exactly the same regardless of the target platform - be it iOS or Xbox.
We should have a neat iteration loop in which I can easily see the machine code generated for any architecture as I change my source code. The machine code “viewer” should be of great help with training / explanation when you need to understand what all these machine instructions do.
Security. As a rule, game developers do not put safety on a high position in their list of priorities, but we believe that one of the coolest features of Unity is that it is really very difficult to damage memory in it. There should be such a mode in which we run any code - and we unambiguously fix an error by which a large letter displays a message about what happened here: for example, I went beyond the limits when reading / writing or tried to dereference zero.

So, having figured out what is important for us, let's move on to the next question: in which language is it better to write programs from which such machine code will then be generated? Let's say we have the following options:

Own language
Some adaptation / subset of C or C ++
Subset of c #

What, C #? For our inner loops whose performance is especially critical? Yes. C # is a completely natural choice, with which in the context of Unity there are a lot of very nice things:

This is the language our users are already working with today.
It has an excellent IDE, both for editing / refactoring, and for debugging.
There is already a compiler that converts C # to an intermediate IL (we are talking about the Roslyn compiler for C # from Microsoft), and you can simply use it instead of writing your own. We have rich experience converting an intermediate language into IL, so we just need to perform code generation and postprocessing a specific program.
C # is devoid of many C ++ problems (hell with the inclusion of headers, PIMPL patterns, long compilation time)

I myself really like writing code in C #. However, traditional C # is not the best language in terms of performance. The C # development team, the teams responsible for the standard library and runtime over the past couple of years have made tremendous progress in this area. However, while working with C #, it is impossible to control exactly where your data is located in memory. And it is precisely this problem that we need to solve to increase productivity.

In addition, the standard library of this language is organized around “objects on the heap” and “objects that have pointers to other objects”.

At the same time, working with a code fragment in which performance is critical, you can almost completely do without a standard library (goodbye to Linq, StringFormatter, List, Dictionary), prohibit selection operations (= no classes, only structures), reflection, disable garbage collector and virtual calls, and add a few new containers that are allowed to use (NativeArray and company). In this case, the remaining elements of the C # language already look very good. See the Aras blog for examples, where it describes a makeshift path tracer project.

Such a subset will help us easily cope with all the tasks that are relevant when working with hot cycles. Since this is a complete subset of C #, you can work with it like with regular C #. We can receive errors associated with going abroad when trying to access, we will get excellent error messages, we will support the debugger, and the compilation speed will be such that you already forgot about it when working with C ++. We often refer to this subset as High Performance C # or HPC #.

Burst compiler: what today?

We wrote a code generator / compiler called Burst. It is available in Unity version 2018.1 and higher as a package in the "preview" mode. A lot of work remains to be done with him, but we are pleased with him today.

Sometimes we manage to work faster than in C ++, often - still more slowly than in C ++. The second category includes performance bugs, which, we are convinced, will be able to cope.

However, simply comparing performance is not enough. No less important is what needs to be done to achieve such performance. Example: we took the culling code from our current C ++ renderer and ported it to Burst. The performance has not changed, but in the C ++ version we had to do incredible balancing act to persuade our C ++ compilers to do vectorization. The version with Burst was about four times more compact.

Honestly, the whole story with "you should rewrite your code critical for performance in C #" at first glance did not appeal to anyone in the internal Unity team. For most of us, it sounded like “closer to the hardware!” When working with C ++. But now the situation has changed. Using C #, we completely control the entire process from compiling the source code to generating machine code, and if we don’t like any detail, we just take it and fix it.

We are going to slowly but surely port all performance-critical code from C ++ to HPC #. In this language, it’s easier to achieve the performance we need, more difficult to write a bug, and easier to work with.

Here's a screenshot of the Burst inspector where you can easily see which build instructions were generated for your various hot loops:

Unity has many different users. Some of them can recall the entire set of arm64 instructions from memory, while others simply create with enthusiasm, even without a PhD in computer science.
All users win when it accelerates the fraction of the frame time that is spent on executing engine code (usually 90% +). The share of working with the executable code of the Asset Store package is really accelerating, as the authors of the Asset Store package are adopting HPC #.

Advanced users will also benefit from the fact that they can write their own high-performance code on HPC #.

Point optimization

In C ++, it is very difficult to get the compiler to make different compromise decisions on optimizing code in different parts of your project. The most detailed optimization you can count on is a file-by-file indication of the level of optimization.

Burst is designed so that you can accept the only method of this program as an input, namely: the entry point to the hot loop. Burst compiles this function, as well as everything that it calls (such called elements must be guaranteed to be known in advance: we do not allow virtual functions or function pointers).

Since Burst operates on only a relatively small part of the program, we set the optimization level to 11. Burst embeds almost every call site. Remove if-checks, which otherwise would not be deleted, since in the embedded form we get more complete information about the function arguments.

How Does It Help Solve Common Threading Problems?

C ++ (as well as C #) do not particularly help developers write thread-safe code.

Even today, more than a decade after a typical game processor began to be equipped with two or more cores, it is very difficult to write programs that efficiently use several cores.

Data racing, non-determinism, and deadlocks are the main challenges that make it so difficult to write multi-threaded code. In this context, we need features from the category of "make sure that this function and everything that it calls will never begin to read or write the global state." We want all violations of this rule to give compiler errors, and not to remain "rules that we hope all programmers will adhere to." Burst throws a compilation error.

We strongly recommend that Unity users (and we keep the same in their circle) write code so that all data transformations planned in it are divided into tasks. Each task is “functional”, and, as a side effect, free. It explicitly indicates read-only buffers and read / write buffers with which it has to work. Any attempt to access other data will cause a compilation error.
Task Scheduler ensures that no one will write to your read-only buffer while your task is running. And we guarantee that for the duration of the task no one will read from your buffer, designed for reading and writing.

Whenever you assign a task that violates these rules, you will receive a compilation error. Not only in such an unfortunate event as the conditions of the race. The error message will explain that you are trying to assign a task that should read from buffer A, but previously you have assigned a task that will write to A. Therefore, if you really want to do this, then the previous task must be specified as a dependency .

We believe that such a safety mechanism helps to catch a lot of bugs before they are fixed, and therefore ensures the efficient use of all cores. It becomes impossible to provoke race conditions or deadlock. The results are guaranteed to be deterministic, regardless of how many threads you have, or how many times a thread is interrupted due to the intervention of some other process.

Master the whole stack

When we can get to the bottom of all these components, we can also ensure that they are aware of each other. For example, a common reason for vectorization failure is this: the compiler cannot guarantee that two pointers will not point to the same memory point (aliasing). We know that two NativeArray will by no means overlap like this, because they have written a collection library, and we can use this knowledge in Burst, so we will not refuse to optimize only for fear that two pointers may be directed to one the same piece of memory.

Similarly, we wrote the Unity.Mathematics math library. Burst she is known "thoroughly" Burst (in the future) will be able to point out opt out of optimization in cases like math.sin (). Since for Burst math.sin () is not just an ordinary C # method that needs to be compiled, it will also understand the trigonometric properties of sin (), it will understand that sin (x) == x for small values of x (which Burst can independently prove ), will understand that it can be replaced by the expansion in the Taylor series, partly sacrificing accuracy. In the future, Burst also plans to implement cross-platform and design determinism with a floating point - we believe that such goals are achievable.

The differences between the game engine code and the game code are blurred

When we write Unity runtime code in HPC #, the game engine and the game as such are all written in the same language. We can distribute the runtime systems that we converted to HPC # as source code. Everyone can learn from them, improve them, adapt them for themselves. We will have a playing field of a certain level, and nothing will prevent our users from writing a better particle system, game physics or a renderer than we wrote. By bringing our internal development processes closer to user development processes, we can also feel better in the shoes of the user, so we will put all our efforts into building a single workflow, rather than two different ones.

Tags:

the books