Optimizing the rendering of scenes from the Disney cartoon "Moana". Part 2

Transfer

Inspired by the first victory in parsing using the description of the island scene from the Disney cartoon “Moana” , I further delved into the study of memory usage. Much more could be done with the execution time, but I decided that it would be useful to first scout the situation.

I started the runtime study with pbrt built-in statistics; pbrt has a manual setting of significant memory allocations to track memory usage, and after the rendering is complete, a memory allocation report is displayed. This is how the report on the allocation of memory for this scene was originally: As for the execution time, the built-in statistics turned out to be brief and only reported on memory allocation for known objects of 24 GB in size.

 Память

 BVH-дерево 9,01 ГиБ

 Кривые 1,44 ГиБ

 MIP-текстуры 2,00 ГиБ

 Меши треугольников 11,02 ГиБ

topreported that in fact used about 70 GB of memory, that is, the statistics do not take into account 45 GB. Small deviations are quite understandable: dynamic memory allocators require additional space to register resource usage, some are lost due to fragmentation, and so on. But 45 GB? Something bad is definitely lurking here.

To understand what we are missing (and to make sure that the tracking was found), I used massif to trace valid allocations of dynamic memory. It's pretty slow, but at least it works well.

Primitives

The first thing I found when tracing massif was two lines of code that contained in memory the instances of the base class Primitivethat were not counted in the statistics. A little oversight that is fairly easy to fix . After this we see the following:

Primitives 24,67 ГиБ

Oops. So what is a primitive, and why all this memory?

pbrt distinguishes Shapewhich is pure geometry (sphere, triangle, etc.) and Primitivewhich is a combination of geometry, material, sometimes the function of radiation and the involved medium inside and outside the surface of the geometry.

There are several variants of the base class Primitive: GeometricPrimitivewhich is the standard case: a “vanilla” combination of geometry, material, etc., as well asTransformedPrimitivewhich is a primitive with transformations applied to it, either as an instance of an object, or for moving primitives with time-varying transformations. It turns out that in this scene both of these types are a waste of space.

GeometricPrimitive: 50% extra space

Note: some erroneous assumptions are made in this analysis; they are revised in the fourth post of the series .

4.3 GB used on GeometricPrimitive. It's funny to live in a world where 4.3 GB of used RAM is not your biggest problem, but let's still see where we have 4.3 GB from GeometricPrimitive. Here are the relevant parts of the class definition:

classGeometricPrimitive :public Primitive {
    std::shared_ptr<Shape> shape;
    std::shared_ptr<Material> material;
    std::shared_ptr<AreaLight> areaLight;
    MediumInterface mediumInterface;
};

We have a pointer to vtable , three more pointers, and then MediumInterfacecontaining two more pointers with a total size of 48 bytes. In this scene, there are only a few meshes radiating illumination; therefore, it is areaLightalmost always a null pointer, and there is no environment influencing the scene, so both pointers mediumInterfaceare also null. Thus, if we had a specialized implementation of the class Primitive, which could be used in the absence of the radiation function and the environment, we would save almost half of the disk space occupied GeometricPrimitive- in our case, about 2 GB.

However, I did not make a correction and add a new implementation to pbrtPrimitive. We strive as much as possible to minimize the differences between the pbrt-v3 source code on github and the system described in my book, for a very simple reason - keeping them in sync makes it easier to read the book and work with the code. In this case, I decided that a completely new implementation Primitive, never mentioned in the book, would be too great a discrepancy. But this fix will definitely appear in the new version of pbrt.

Before proceeding further, we will perform a verification rendering:

The beach from the island of the movie "Moana", rendered pbrt-v3 with a resolution of 2048x858 and 256 samples per pixel. The total rendering time on a 12-core / 24-thread instance of the Google Compute Engine with a frequency of 2 GHz with the latest version of pbrt-v3 was 2 h 25 min 43 s.

TransformedPrimitives: 95% of wasted space

Memory allocated for 4.3 GB GeometricPrimitivewas a pretty painful hit, but what about 17.4 GB for TransformedPrimitive?

As mentioned above, it is TransformedPrimitiveused both for transformations with a change in time and for instances of objects. In both cases, we need to apply an additional transformation to the existing one Primitive. There TransformedPrimitiveare only two members in the class :

std::shared_ptr<Primitive> primitive;
    const AnimatedTransform PrimitiveToWorld;

So far so good: a pointer to a primitive and a time-varying transformation. But what is actually stored in AnimatedTransform?

const Transform *startTransform, *endTransform;
    const Float startTime, endTime;
    constbool actuallyAnimated;
    Vector3f T[2];
    Quaternion R[2];
    Matrix4x4 S[2];
    bool hasRotation;
    structDerivativeTerm {// ...
        Float kc, kx, ky, kz;
    };
    DerivativeTerm c1[3], c2[3], c3[3], c4[3], c5[3];

In addition to the pointers to the two transition matrices and the associated time, there is also a decomposition of the matrices into components of translation, rotation, and scaling, as well as pre-calculated values used to limit the volume occupied by moving bounding boxes (see section 2.4.9 of our book Physically Based Rendering ). All this adds up to 456 bytes.

But nothing is moving in this scene . From the point of view of transformations, for instances of objects we need one pointer to transformation, and the values for decomposition and moving bounding boxes do not need it. (That is all you need 8 bytes). If you create a separate implementation Primitivefor fixed instances of objects, 17.4 GB are compressed to only 900 MB (!).

As for GeometricPrimitive, fixing it is a non-trivial change compared to that described in the book, so we will also set it aside for the next version of pbrt. At least, we now understand what is happening with the chaos of 24.7 GB of memory Primitive.

Conversion Cache Trouble

The next largest block of unrecorded memory, determined by massif, was TransformCachethat occupied approximately 16 GB. (Here is a link to the original implementation .) The idea is that the same transformation matrix is often used several times in the scene, so it’s best to have a single copy in memory so that everyone using its elements simply stores a pointer to the same thing. transformation. Used

to store the cache , and massif reported that 6 out of 16 GB was used for black and mahogany nodes in . This is a terrible lot: 60% of this volume is used for the transformations themselves. Let's look at the announcement for this distribution:TransformCachestd::mapstd::map

std::map<Transform, std::pair<Transform *, Transform *>> cache;

Here the work is done perfectly: Transformentirely used as keys for distribution. Even better, pbrt Transformstores two 4x4 matrices (the transformation matrix and its inverse matrix), which results in 128 bytes of storage in each node of the tree. All this is absolutely unnecessary for the value stored for it.

Perhaps such a structure is quite normal in a world where it is important for us that the same transformation matrix is used in hundreds or thousands of primitives, and in general there are not so many transformation matrices. But for a scene with a bunch of mostly unique transformation matrices, as in our case, this is just a terrible approach.

In addition to wasting space on keys, searching instd::mapTo traverse a red-black tree, there are a lot of pointer traversal operations, so it seems logical to try something completely new. Fortunately, a TransformCachelittle is written about the book, so it’s perfectly acceptable to completely rewrite it.

And the last thing before we get started: after examining the method signature Lookup(), another problem becomes apparent:

voidLookup(const Transform &t, Transform **tCached,
            Transform **tCachedInverse)

When the calling function provides Transform, the cache stores and returns pointers to the transformation, equal to the transmitted one, but also passes the inverse matrix. To make this possible, in the original implementation, when adding a transformation to the cache, the inverse matrix is always calculated and saved so that it can be returned.

The stupidity here is that most of the dial peers that use the transform cache do not query and do not use the inverse matrix. That is, different types of memory are wasted on unused reversible conversions.

The new implementation adds the following improvements:

It uses a hash table that allows you to speed up the search, and does not require storing anything but an array Transform *, which, in fact, reduces the amount of memory used to the value that is really needed to store all Transform.
The signature of the search method now has the form ; in one place where the calling function wants to get an inverse matrix from the cache, it simply calls twice.Transform *Lookup(const Transform &t)Lookup()

For hashing, I used the hash function FNV1a . After its implementation, I discovered Aras’s post about hash functions ; Maybe I just needed to use xxHash or CityHash, because their performance is better; maybe someday my shame will win and i will fix it.

Thanks to the new implementationTransformCachethe total time of system launch has decreased significantly, to 21 minutes 42 seconds. That is, we saved another 5 minutes 7 seconds, or accelerated 1.27 times. Moreover, more efficient use of memory has reduced the space occupied by the transformation matrices from 16 to 5.7 GB, which is almost equal to the amount of stored data. This allowed us not to try to take advantage of the fact that they are not projective, and to store 3x4 matrices instead of 4x4. (In the usual case, I would be skeptical about the importance of this kind of optimization, but here it would save us more gigabytes - a lot of memory! This is definitely worth doing in the renderer of production.)

Minor performance optimization to complete

Too generalized structure TransformedPrimitivecosts us both memory and time: the profiler reported that a considerable part of the time at launch was spent in a function AnimatedTransform::Decompose()that decomposes the transformations of the matrix into a quaternion rotation, transfer and scale. Since nothing is moving in this scene, this work is not needed, and a thorough implementation check AnimatedTransformhas shown that none of these values are accessed if the two transformation matrices are actually identical.

By adding two lines to the constructor so that decomposition transformations are not performed when they are not required, we saved another 1 min 31 of the launch time: as a result, we came to 20 min 9 s, that is, we generally accelerated 1.73 times.

In the next article we will take seriously the parser and analyze what became important when we accelerated the work of other parts.

Tags: