Optimizing the rendering of scenes from the Disney cartoon "Moana". Part 3

http://pharr.org/matt/blog/2018/07/13/moana-island-pbrt-3.html

Transfer

Today we will look at two more places in which pbrt spends a lot of time parsing scenes from the Disney cartoon Moana . Let's see if we can improve performance here. This concludes what is reasonable to do in pbrt-v3. In another post, I will deal with how far we can go if we give up the ban on making changes. In this case, the source code will be too different from the system described in the book Physically Based Rendering .

Optimization of the parser itself

After the performance improvements introduced in the previous article , the proportion of time spent in the pbrt parser, and so significant from the outset, naturally increased even more. Currently the most time is spent on the parser at startup.

I finally braced up and implemented a handwritten tokenizer and parser for pbrt scenes. Pbrt scene file formatParsing is pretty simple: if you do not take into account the quoted lines, the tokens are separated by spaces, and the grammar is very straightforward (you never need to look forward more than one token), but your own parser is still a thousand lines of code that you need to write and debug. It helped me that it could be tested on many scenes; after correcting the obvious failures, I continued to work until I managed to render exactly the same images as before: there should be no difference in pixels due to the replacement of the parser. At this stage I was absolutely sure that everything was done correctly.

I tried to make the new version as efficient as possible, subjecting the input files whenever possible mmap()and using the new implementation std::string_viewfrom C ++ 17 to minimize the creation of copies of lines from the contents of the file. In addition, since in previous traces a lot of time was spent on strtod(), I wrote функцию parseNumber()with special care: single-digit integers and ordinary integers are processed separately, and in the standard case, when pbrt is compiled to use 32-bit float, I used strtof()instead of strtod()¹ .

In the process of creating a new parser implementation, I was a little afraid that the old parser would be faster: after all, flex and bison have been developed and optimized for many years. I could not find out in advance whether all the time to write a new version would be wasted until I completed it and did not get it to work properly.

To my delight, my own parser turned out to be a huge victory: the generalization of flex and bison reduced performance so much that the new version easily overtook them. Thanks to the new parser, the launch time decreased to 13 minutes 21 seconds, that is, it accelerated another 1.5 times! An added bonus was that from the pbrt build system it was now possible to remove all flex and bison support. It has always been a headache, especially under Windows, where most people don’t have them installed by default.

Graphic State Management

After a significant acceleration of the parser's work, a new annoying detail surfaced: at this stage, approximately 10% of the setup time was spent on functions pbrtAttributeBegin()and pbrtAttributeEnd(), and most of this time was occupied by the allocation and release of dynamic memory. During the first start, which took 35 minutes, these functions took only about 3% of the execution time, so they could be ignored. But when optimizing it is always like this: when you start to get rid of big problems, small ones become more important.

The pbrt scene description is based on the hierarchical state of the graphics, which indicates the current transformation, the current material, and so on. You can make snapshots of the current state (pbrtAttributeBegin()), make changes to it before adding a new geometry to the scene, and then returning to the original state ( pbrtAttributeEnd()).

Graphics state is stored in the structure with an unexpected name ... GraphicsState. To store copies of objects GraphicsStatein the stack of saved states of graphics is used std::vector. Looking at the members GraphicsState, we can assume the source of the problems - three std::map, from names to instances of textures and materials:

structGraphicsState {// ...std::map<std::string, std::shared_ptr<Texture<Float>>> floatTextures;
    std::map<std::string, std::shared_ptr<Texture<Spectrum>>> spectrumTextures;
    std::map<std::string, std::shared_ptr<MaterialInstance>> namedMaterials;
};

Exploring these scene files, I found that most cases of saving and restoring the state of the graphics are performed on these lines:

AttributeBegin
    ConcatTransform [0.9812620.133695-0.1387490.000000-0.0679010.9138460.4003430.0000000.180319-0.3834200.9058000.00000011.09530118.8522499.4813991.000000]
    ObjectInstance "archivebaycedar0001_mod"
AttributeEnd

In other words, it updates the current transformation and creates an object instance; std::mapno changes are made to the contents of these . Creating a full copy of them — allocating nodes of a red-black tree, increasing the reference counters of common pointers, allocating space, and copying lines — is almost always a waste of time. All this is released when the previous state of the graphics is restored.

I replaced each of these maps std::shared_ptrwith a pointer to a map and implemented the copy-on-write approach, in which copying inside the begin / end block of an attribute occurs only when its content needs to be changed. Change It turned out not to be very difficult, but it reduced the launch time by more than a minute, which gave us 12 minutes 20 seconds of processing before rendering - again acceleration 1.08 times.

What about rendering time?

The attentive reader will notice that while I did not say anything about the rendering time. To my surprise, it turned out to be quite tolerable even “out of the box”: pbrt can render images of cinematic-quality scenes with several hundred samples per pixel on twelve processor cores over a period of two to three hours. For example, this image, one of the slowest, rendered in 2 hours and 51 minutes and 36 seconds:

Dunes from Moana, rendered pbrt-v3 with a resolution of 2048x858 with 256 samples per pixel. The total rendering time on the Google Compute Engine instance with 12 cores / 24 threads with a frequency of 2 GHz and the latest version of pbrt-v3 was 2 h 51 min 36 s.

In my opinion this seems to be a surprisingly reasonable indicator. I am sure that improvements are still possible, and after careful study of the places where the most time is spent, a lot of “interesting” things will open, but so far there are no special reasons for their research.

When profiling, it turned out that approximately 60% of the rendering time was spent on intersecting rays with objects (most of the operations were performed during the BVH bypass), and 25% was spent on searching for ptex textures. These ratios are similar to the indicators of simpler scenes, so at first glance there is nothing obviously problematic here. (However, I am confident that Embree will be able to trace these rays in a little less time.)

Unfortunately, parallel scalability is not so good. I usually see that 1400% of CPU resources are spent on rendering, compared to the 2400% ideal (per 24 virtual CPUs on Google Compute Engine). It seems that the problem is related to conflicts with locks in ptex, but I haven’t investigated it in more detail yet. It is very likely that the pbrt-v3 does not compute the ray difference for indirect rays in the ray tracer; in turn, such rays always get access to the most detailed MIP-level of textures, which is not very useful for texture caching.

Conclusion (for pbrt-v3)

Having corrected the management of the state of the graphics, I rested on the limit, after which further progress without making significant changes to the system became unclear; all the rest took a lot of time and was not very optimistic. Therefore, I’ll dwell on this, at least as far as pbrt-v3 is concerned.

In general, the progress was serious: the launch time before rendering decreased from 35 minutes to 12 minutes and 20 seconds, that is, the total acceleration was 2.83 times. Moreover, thanks to clever work with the conversion cache, the memory usage has decreased 80 GB to 69 GB. All these changes are available now, if you synchronize with the latest version of pbrt-v3 (or if you have done this in the last few months.) And we come to understand how garbage is memoryPrimitivefor this scene; we figured out how to save another 18 GB of memory, but did not implement it in pbrt-v3.

This is what these 12 minutes and 20 seconds are spent on after all our optimizations:

Function / operation	Percentage of execution time
Build BVH	34%
Parsing (except `strtof()`)	21%
`strtof()`	20%
Conversion cache	7%
Read PLY files	6%
Allocation of dynamic memory	five%
Conversion mapping	2%
Graphic State Management	2%
Other	3%

In the future, even better multithreading of the launch phase will be the best option for improving performance: almost everything during scene parsing is single-threaded; our most natural first goal is building a BVH. It will also be interesting to analyze such things as reading PLY files and generating BVH for individual instances of objects and executing them asynchronously in the background, while parsing will be done in the main thread.

At some point, I'll see if there are faster implementations strtof(); pbrt uses only what the system provides. However, it is worth being careful with the choice of replacements that are not very carefully tested: parsing of float values is one of those aspects of which the programmer must be completely sure.

A further reduction in the load on the parser also looks attractive: we still have 17 GB of text input files for parsing. We can add support for binary encoding of the pbrt input files (perhaps, by analogy with the RenderMan approach ), but I have mixed feelings about this idea; The ability to open and edit scene description files in a text editor is quite useful, and I worry that sometimes binary coding will confuse students who use pbrt in the learning process. This is one of those cases where the correct solution for pbrt may differ from the solutions for a commercial production-level renderer.

It was very interesting to keep track of all these optimizations and to better understand the various solutions. It turned out that pbrt has unexpected assumptions that interfere with the scene of this level of complexity. All this is an excellent example of how important it is for a wide community of rendering researchers to have access to real production scenes with a high degree of complexity; I say again many thanks to Disney for taking the time to process this scene and put it in open access.

In the next article , we will look at aspects that can improve performance even more if we allow more radical changes to pbrt.

Note

On a Linux system in which I performed testing, strtof()no faster than strtod(). It is noteworthy that on OS X is strtod()about twice as fast, which is completely illogical. For practical reasons, I continued to use strtof().

Tags: