Optimizing the rendering of scenes from the Disney cartoon "Moana". Parts 4 and 5

http://pharr.org/matt/blog/2018/07/15/moana-island-pbrt-4.html

Transfer

I have a pbrt branch, which I use to test new ideas, implement interesting ideas from scientific articles and in general to research everything that usually results in the new edition of the book Physically Based Rendering . Unlike pbrt-v3 , which we strive to keep as close as possible to the system described in the book, in this thread we can change anything. Today we will see how more radical changes to the system will significantly reduce the use of memory in the scene with an island from the Disney cartoon "Moana" .

A note on the methodology: in the previous three posts, all statistics were measured for the WIP version (Work In Progress) of the scene with which I worked before its release. In this article we will move on to the final version, which is a bit more complicated.

When rendering the last island scene from Moana, the pbrt-v3 used 81 GB of RAM to store the pbrt-v3 scene description. Currently, pbrt-next uses 41 GB - approximately two times less. To obtain such a result, it was enough to make small changes that resulted in several hundred lines of code.

Reduced primitives

Let us recall that in pbrt it Primitiveis a combination of geometry, its material, the function of radiation (if it is a source of illumination), and records of the environment inside and outside the surface. The pbrt-v3 GeometricPrimitivestores the following:

std::shared_ptr<Shape> shape;
    std::shared_ptr<Material> material;
    std::shared_ptr<AreaLight> areaLight;
    MediumInterface mediumInterface;

As said earlier , most of the time areaLightis nullptr, and in MediumInterfacecontains a pair nullptr. Therefore, in pbrt-next, I added a variant Primitivecalled SimplePrimitive, which stores only pointers to geometry and material. Where possible, it is used instead of GeometricPrimitive:

classSimplePrimitive :public Primitive {
    // ...std::shared_ptr<Shape> shape;
    std::shared_ptr<Material> material;
};

For non-animated instances of objects, we now have TransformedPrimitiveone that stores only a pointer to the primitive and a transformation, which saves us about 500 bytes of wasted space that the instance AnimatedTransformadded to the TransformedPrimitivepbrt-v3 renderer.

classTransformedPrimitive :public Primitive {
    // ...std::shared_ptr<Primitive> primitive;
    std::shared_ptr<Transform> PrimitiveToWorld;
};

(In case of the need for an animated conversion to pbrt-next there is AnimatedPrimitive.)

After all these changes, the statistics report that Primitiveonly 7.8 GB are used, instead of 28.9 GB used in pbrt-v3. Although it's great that we saved 21 GB, this is not as much as the reduction we could expect from previous estimates; we will return to this discrepancy by the end of this part.

Reduced geometry

Also in pbrt-next, the amount of memory occupied by the geometry was significantly reduced: the space used for triangle meshes decreased from 19.4 GB to 9.9 GB, and the space for storing curves from 1.4 to 1.1 GB. Slightly more than half of this savings came from the simplification of the base class Shape.

In pbrt-v3 it Shapecarries with it several members that are carried over to all implementations Shape— these are the several aspects that are convenient to have access to in implementations Shape.

classShape {// ....const Transform *ObjectToWorld, *WorldToObject;
    constbool reverseOrientation;
    constbool transformSwapsHandedness;
};

To understand why these member variables cause problems, it is helpful to understand how meshes of triangles are represented in pbrt. First, there is a class TriangleMeshin which vertices and index buffers are stored for the entire mesh:

structTriangleMesh {int nTriangles, nVertices;
    std::vector<int> vertexIndices;
    std::unique_ptr<Point3f[]> p;
    std::unique_ptr<Normal3f[]> n;
    // ...
};

Each triangle in the mesh is represented by a class Trianglethat is inherited from Shape. The idea is to keep it as small as possible Triangle: they only store a pointer to the mesh of which they are part, and a pointer to an offset in the index buffer, from which the indices of its vertices begin:

classTriangle :public Shape {
    // ...std::shared_ptr<TriangleMesh> mesh;
    constint *v;
};

When the implementation Triangleneeds to find the positions of its vertices, it performs the appropriate indexing to get them from TriangleMesh.

The problem with Shapepbrt-v3 is that the values stored in it are the same for all triangles of the mesh, so it’s best to save them from each whole mesh to TriangleMesh, and then provide Triangleaccess to a single copy of the common values.

This problem is fixed in pbrt-next: the base class Shapein pbrt-next does not contain such members, and therefore each Triangleis 24 bytes less. Geometry Curveuses a similar strategy and also benefits from using a more compact one Shape.

Triangle common buffers

Despite the fact that the Moana island scene actively uses the creation of object instances for clearly repetitive geometry, I was wondering how often reuse of index buffers, texture coordinate buffers and so on is used for various triangle meshes.

I wrote a small class that hashed these buffers when they were received and saved them to the cache, and changed it TriangleMeshso that it checks the cache and uses the already saved version of any excess buffer it needs. The win turned out to be very good: I managed to get rid of 4.7 GB of excess capacity, which is much more than what I expected.

Catastrophe with std :: shared_ptr

After all these changes, the statistics reports about 36 GB of known allocated memory, and at the beginning of rendering it topindicates the use of 53 GB. Cause

I was afraid of another series of slow runs massifto find out which allocated memory is missing in the statistics, but then a letter from Arseny Kapulkin appeared in my inbox . Arseny explained to me that my previous memory usage estimatesGeometricPrimitive were badly flawed. I had to understand for a long time, but then I understood; Many thanks to Arseny for pointing out the error and detailed explanations.

Before the letter to Arseny, I mentally imagined the realizationstd::shared_ptr as follows: in these lines there is a common descriptor storing the reference counter and a pointer to the object itself:

template <typename T> classshared_ptr_info {std::atomic<int> refCount;
    T *ptr;
};

Then I assumed that the instance shared_ptrsimply points to it and uses it:

template <typename T> classshared_ptr {// ...
    T *operator->() { return info->ptr; }
    shared_ptr_info<T> *info;
};

In short, I assumed that sizeof(shared_ptr<>)this is the same thing as the pointer size, and that for every shared pointer, 16 bytes of extra space are wasted.

But it is not.

In the implementation of my system, the total descriptor is 32 bytes in size, and sizeof(shared_ptr<>)16 bytes in size. Consequently GeometricPrimitive, which mainly consists of std::shared_ptrapproximately twice as many of my estimates. If you are wondering why this happened, then in these two posts on Stack Overflow the reasons are explained in detail: 1 and 2 .

In almost all cases of use std::shared_ptrin pbrt-next, they are not required to be general pointers. Pursuing insane hacking, I replaced everything I could onstd::unique_ptrwhich is actually the same size as a regular pointer. For example, this is how it looks now SimplePrimitive:

classSimplePrimitive :public Primitive {
    // ...std::unique_ptr<Shape> shape;
    const Material *material;
};

The reward turned out to be more than I expected: memory usage at the beginning of rendering dropped from 53 GB to 41 GB - saving 12 GB, quite unexpected a few days ago, and the total volume is almost two times less than the pbrt-v3 used. Fine!

In the next part, we will finally complete this series of articles - examine the rendering speed in pbrt-next and discuss ideas for other ways to reduce the amount of memory needed for this scene.

Part 5.

To summarize this series of articles, we will start by examining the rendering speed of the island scene from the Disney cartoon “Moana” in pbrt-next - the pbrt branch, which I use to test new ideas. We will make more radical changes than is possible in pbrt-v3, which should adhere to the system described in our book. We conclude with a discussion of the directions for further improvements, from the simplest to the bit extreme.

Rendering time

In pbrt-next, many changes have been made to the light transfer algorithms, including changes in BSDF sampling and improvements to Russian roulette algorithms. As a result, it renders more rays than pbrt-v3 to render this scene, so it’s impossible to directly compare the execution time of these two renderers. The speed is generally close, with one important exception: when rendering the island scene from Moana , shown below, pbrt-v3 spends 14.5% of its execution time on performing texture searches for ptex . Previously, it seemed to me quite normal, but pbrt-next spends only 2.2% of the execution time. All this is terribly interesting.

After studying the statistics, we get ¹ :

pbrt-v3:

 Считывания блоков Ptex 20828624

 Поиски Ptex 712324767


pbrt-next:

 Считывания блоков Ptex 3378524

 Поиски Ptex 825826507

As we see in pbrt-v3, the ptex texture is read from the disk on average every 34 texture searches. In pbrt-next, it is read only every 244 searches — that is, disk I / O has decreased by about 7 times. I assumed that this happens because pbrt-next calculates ray differences for indirect rays, and this results in accessing higher MIP levels of textures, which in turn creates a more complete series of access to the ptex texture cache, reduces the number of cache misses, and hence the number of I / O operations ² . A brief check confirmed my guess: when disabling the difference in rays, the ptex speed became much worse.

The increase in ptex speed has not only affected the savings in computing and I / O. In the system with 32 CPUs, the pbrt-v3 had an acceleration of just 14.9 times after the parsing of the scene description was completed. pbrt usually demonstrates close to linear parallel scaling, which is why it pretty much disappointed me. Due to a much smaller number of conflicts with locks in ptex, the pbrt-next version was 29.2 times faster in the system with 32 CPUs, and 94.9 times faster in the system with 96 CPUs - we returned to our indicators again.

The roots of the island scene "Moana", rendered pbrt with a resolution of 2048x858 with 256 samples per pixel. The total rendering time on a Google Compute Engine instance with 96 virtual CPUs with a frequency of 2 GHz in pbrt-next is 41 minutes 22 seconds. Acceleration due to mulithreading during rendering was 94.9 times. (I do not quite understand what is happening with the bump mapping here.)

Work for the future

Reducing the amount of memory used in such complex scenes is a fascinating exercise: saving a few gigabytes with a small change is much more pleasing than dozens of megabytes saved in a simpler scene. I have a good list of what I hope to explore in the future, if time allows. Here is a quick overview.

Further decrease in triangle buffer memory

Even with repeated use of buffers that store the same values for several triangle meshes for triangle buffers, quite a lot of memory is still used. Here is a breakdown of memory usage for different types of triangle buffers in a scene:

Type of	Memory
Positions	2.5 GB
Normals	2.5 GB
UV	98 MB
Indices	252 MB

I understand that nothing can be done with the transmitted vertex positions, but for other data there are opportunities for saving. There are many types of representation of normal vectors in a memory-efficient way , providing different trade-offs between the amount of memory / number of calculations. Using one of the 24-bit or 32-bit representations will reduce the space occupied by the normals to 663 MB and 864 MB, which will save us more than 1.5 GB of RAM.

In this scene, the amount of memory used for storing texture coordinates and index buffers is surprisingly small. I suppose that this happened because of the presence of a set of procedurally generated plants in the scene and due to the fact that all variations of the same type of plants have the same topology (and hence the index buffer) with parametrization (and therefore UV coordinates). In turn, reuse of matching buffers is quite effective.

For other scenes, it may be quite appropriate to sample the 16-bit UV coordinates of the textures or to use half-precision float values, depending on their range of values. It seems that in this scene all the coordinates of the textures are zero or one, which means that they can be represented by one bit- that is, it is possible to reduce the amount of memory occupied by 32 times. This state of affairs probably arose from the use of the ptex format for texturing, which eliminates the need for UV atlases. Taking into account the small volume occupied now by the coordinates of the textures, the implementation of this optimization is not particularly necessary.

pbrt always uses 32-bit integers for index buffers. For small meshes of less than 256 vertices, just 8 bits per index is enough, and for meshes less than 65,536 vertices, 16 bits can be used. Modifying pbrt to adapt it to this format will not be very difficult. If we wanted to optimize to the maximum, we could allocate exactly as many bits as necessary to represent the required range in the indices, while the price would be an increase in the complexity of finding their values. With the fact that now only a quarter of a gigabyte of memory is used for vertex indices, this task, compared to others, does not look very interesting.

Peak memory usage build bvh

We have not discussed yet another detail of memory use: there is a short-term peak of 10 GB of additional memory immediately before rendering. This happens when the (big) BVH of the entire scene is built. The pbrt renderer's BVH build code is written to run in two phases: first, it creates a BVH with the traditional presentation : two child pointers to each node. After building the tree, it is converted to a memory efficient scheme , in which the first child of the node is in memory immediately behind it, and the offset to the second child node is stored as an integer.

This separation was necessary from the point of view of teaching students - it is much easier to understand the algorithms for constructing BVH without the chaos associated with the need to transform the tree into a compact form in the construction process. However, the result is this peak memory usage; given its influence on the scene, the elimination of this problem seems attractive.

Converting pointers to integers

There are many 64-bit pointers in various data structures that can be represented as 32-bit integers. For example, each SimplePrimitivecontains a pointer to Material. Most instances Materialare common to many primitive scenes and are never more than a few thousand; therefore, we can store a single global vector of vectorall materials:

std::vector<Material *> allMaterials;

and just store 32-bit integer offsets on this vector in SimplePrimitive, which will save us 4 bytes. The same trick can be used with a pointer to TriangleMeshin each Triangle, as well as in many other places.

After such a change, there will be a slight redundancy in accessing the pointers themselves, and the system will become a little less understandable for students trying to understand its work; In addition, this is probably the case when, in the context of pbrt, it is better to maintain a slightly greater clarity of implementation, although at the cost of incomplete memory optimization.

Accommodation based on arenas (regions)

For each individual Triangleand primitive, a separate call is made new(in fact make_unique, but it is the same). Such memory allocations lead to the use of additional resource accounting, which occupies about five gigabytes of memory, which is not taken into account in statistics. Since the lifespan of all such placements is the same - until the rendering is complete - we can get rid of this additional accounting by selecting them from the memory arena (memory arena) .

Khaki vtable

My last idea is terrible, and I apologize for it, but it intrigued me.

Each triangle in the scene has an extra load of at least two vtable pointers: one for Triangle, and one for SimplePrimitive. This is 16 bytes. In the island scene of Moana, there are a total of 146,162,124 unique triangles, which adds almost 2.2 GB of redundant vtable pointers.

What if we didn’t have an abstract base class for Shapeand every geometry implementation didn’t inherit from anything? This would save us a place on vtable pointers, but, of course, when we passed a pointer to a geometry, we would not know what kind of geometry it is, that is, it would be useless.

It turns out that on modern x86 CPUsonly 48 bits from 64-bit pointers are used . Therefore, there are extra 16 bits that we can borrow to store some information ... for example, the type of geometry we are pointing to. In turn, by adding a bit of work, we can make a way back to the possibility of creating an analogue of calls to virtual functions.

This is how it will happen: first, we define a structure ShapeMethodsthat contains function pointers, for example, ³ :

structShapeMethods {
   Bounds3f (*WorldBound)(void *);
   // Intersect, etc. ...
};

Each geometry implementation will implement a constraint function, an intersection function, and so on, taking the analog pointer as the first argument this:

Bounds3f TriangleWorldBound(void *t){
    // Этой функции передаются только указатели на Triangle.
    Triangle *tri = (Triangle *)t;
    // ...

We would have a global table of structures ShapeMethodsin which the nth element would be for a geometry type with index n :

ShapeMethods shapeMethods[] = {
  { TriangleWorldBound, /*...*/ },
  { CurveWorldBound, /*...*/ };
  // ...
};

When creating a geometry, we encode its type into some of the unused bits of the returned pointer. Then, taking into account the pointer to the geometry, the specific call of which we want to perform, we would extract this type index from the pointer and use it as an index in shapeMethodsto find the corresponding function pointer. In fact, we would implement vtable manually, handling dispatch on our own. If we did this for both geometry and primitives, then we would save 16 bytes each Triangle, however, having done a rather difficult path.

I suppose that such a hack for implementing virtual functions management is not new, but I could not find links to it on the Internet. Here's a wikipedia page about tagged pointers.however, it deals with things like reference counters. If you know the link better, then send me her letter.

By sharing this clunky hack, I can complete a series of posts. Again I express my deep gratitude to Disney for publishing this scene. It was amazing to work with her; the gears in my head keep spinning.

Notes

In the end, pbrt-next traces more rays in this scene than pbrt-v3, which probably explains the increase in the number of search operations.
The ray differences for indirect rays in pbrt-next are calculated using the same hack used in expanding the texture cache for pbrt-v3. It seems that it works quite well, but its principles do not seem to me very tested.
That is how Rayshade handles the assignment of methods . This approach is used when analogue of virtual methods is necessary in C. However, Rayshade does nothing special to eliminate pointers to each object.

Tags: