saul September 25, 2015 at 09:02

Fix Galactic Civilizations 3 CPU and GPU Timing Delays

Transfer

Galactic Civilizations 3 (GC3) is a turn-based global strategy developed and released by Stardock Entertainment . The game was released on May 14, 2015. During the demo access and beta testing, we collected and analyzed information about the performance of rendering processes in this game. One of the main improvements that we managed to implement was the elimination of several sources of delay in synchronization of the central and graphic processors, as a result of which the parallelism in the operation of processors was disrupted. This article describes the identified problem and the solution found, and also discusses the importance of using performance analysis tools in the development process, taking into account the advantages and disadvantages of these tools.

Problem identification

We began to study rendering performance using the Graphics Performance Analyzers , which are part of the Intel INDE package. The screenshot below shows the trace data (without vertical synchronization) before implementing the enhancements. In the queue of the GPU, gaps are observed within and between frames, and at each moment of time, the amount of deferred load is less than one frame. If the GPU queue does not receive enough resources from the central processor, there are time gaps that the application cannot use to improve performance and rendering accuracy.

Up to: frame duration - about 21 ms; queue length - less than 1 frame; GPU queues calls to the Map method

that are too long In addition, the GPA Platform Analyzer interface shows the time spent processing each call to the Direct3D * 11 API (that is, passing each command along the path "application - runtime - driver" and receiving a response). The screenshot shows a call to the ID3D11DeviceContext :: Map method, which takes about 15 ms to receive a response. During this time, the main application thread is idle.

The image below shows an enlarged timeline with a processing interval of one frame (from the beginning of the operation performed by the central processor to the end of the operation performed by the graphics processor). Downtimes are indicated by pink rectangles; their total duration is about 3.5 ms per frame. Platform Analyzer also displays the total duration of calls to various APIs in a given trace (4,306 seconds), from which Map calls take 4,015 seconds!

It should be noted that the Frame Analyzer tool cannot detect a long Map call using frame capture. Frame Analyzer queries GPU timer data to measure erg time, which includes state changes, resource binding, and rendering. Map calls are made by the central processor without the participation of the GPU.

Finding the source of the problem

(In the section on Direct3D resources, at the end of the article, you will find basic instructions for using and updating resources.)
The driver debugging tool revealed that a long Map call uses the DX11_MAP_WRITE_DISCARD flag (Map call arguments are not displayed in the Platform Analyzer interface) to update a large vertex buffer created using the D3D11_USAGE_DYNAMIC flag .

This method is very often used when creating games to optimize data flows when accessing frequently updated resources. When mapping a dynamic resource using characteristic DX11_MAP_WRITE_DISCARDthe function returns the alias selected from the heap of aliases of the given resource. Alias is responsible for allocating memory for this resource at each mapping. When the space for aliases in the used resource heap runs out, a shadow heap of aliases is allocated. This continues until the maximum number of heaps for this resource is reached.

This was precisely the problem in the game Galactic Civilizations 3. Each time a similar situation occurred (that is, several times during the processing of each frame for several large resources that were mapped many times), the driver waited until the Draw method, using the previously assigned resource alias, completed the process to use this alias for another request. This problem arose not only with the Intel driver. It also occurred with the NVIDIA driver, in which case we used the GPUView tool to confirm the data obtained using the Platform Analyzer.

The vertex buffer size was about 560 KB (determined using the driver), and matching with the buffer was performed approximately 50 times during the processing of one frame (with a reset). To store aliases, the Intel driver allocates on demand several heaps (1 MB each) per resource. Aliases are allocated from the heap until the limit is reached, after which another shadow heap of aliases of 1 MB in size is assigned to the resource and so on. In the case of a long call to Map, the heap contained no more than one alias, so each time the Map method accessed the resource, a new shadow heap was created for the new alias until the maximum number of heaps was reached. This happened during the processing of each frame (this explains the repetition of the configuration in the diagram). At each call, the driver expected

We examined the API log in Frame Analyzer and sorted the resources that were mapped several times. There were several cases where comparison with the vertex buffer was performed more than 50 times, and the user interface system turned out to be the main source of the problem. The driver debugging tool revealed that with each mapping, only a small fragment of the buffer was updated.

The same resource (with identifier 2322) is repeatedly mapped during the processing of one frame

Solution

We at Stardock configured all visualization systems to display additional markers on the Platform Analyzer timeline , in particular, to make sure that the source of the call was too long for the user interface, as well as for creating profiles in the future.
We had several possible ways to solve the problem.

You could replace the D3D11_MAP_WRITE_DISCARD flag with D3D11_MAP_WRITE_NO_OVERWRITE to call Map. A large vertex buffer is used by several similar elements. For example, most user interface elements on a screen share a large buffer. Each Map call updates only a small, separate fragment of the buffer. Spaceships and asteroids, for which instance storage technology is used, also use a large vertex buffer (instance data). In this case, the D3D11_MAP_WRITE_NO_OVERWRITE flag would be an ideal solution, since it would protect the buffer fragments that can be used by the GPU at the moment from being overwritten by the application.
It was possible to divide the large vertex buffer into several small ones. Since the reason for the synchronization failure was the allocation of aliases, thanks to a significant reduction in the size of the vertex buffer, the heap could accommodate several aliases. The number of Draw calls in the Galactic Civilizations 3 application is limited, so reducing the buffer size by 10 or 100 times (from 560 KB to 5-50 KB) allowed us to solve the problem.
Another option was to use the D3D11_MAP_FLAG_DO_NOT_WAIT flag . Using it, you can determine whether a given resource is used by the GPU and perform another task before the resource is freed up for a new mapping. Despite the fact that in this case the load is performed by the central processor, this solution was far from optimal for this problem.

We chose the second option and replaced the constant in the buffer creation algorithm. The vertex buffer sizes for each subsystem were hard-coded; they only needed to be reduced. Now, each 1 MB heap could hold several aliases and, thanks to the relatively small number of Draw calls in the Galactic Civilizations 3 application, the problem should have disappeared.
The elimination of this problem in one visualization subsystem increased its scope in another, therefore, the described actions were performed in all subsystems. The screenshot below shows the trace taking into account corrections and the introduction of new tools, as well as an enlarged view of one frame.

After: frame duration - about 16 ms; queue length - 3 frames; lack of gaps in the queue of the GPU; no long calls to the Map method.

The total duration of calls to the Map method has been reduced from 4 seconds to 157 milliseconds! The delays in the GPU queue have disappeared. The queue duration consistently amounted to 3 frames, and at the end of the processing of the frame by the GPU, the next frame was already waiting for its turn! A few simple changes helped ensure the continuous operation of the GPU. The performance increase was about 24%: the processing time of each frame was reduced from about 21 to 16 ms.

Conclusion

Optimizing the performance of visualization processes in games is not an easy task. Capture and playback tools for frames and tracks provide various important information about game performance. This article examined synchronization delays of the central and graphic processors, the diagnostics of which require such tracing tools as GPA Platform Analyzer or GPUView.

Direct3D * Resource Basics

In the Direct3D API, you can allocate resources for creating and deleting resources, setting the status of the rendering pipeline, binding resources to pipeline elements, and also means for updating certain resources. Most resource creation operations are performed during the loading of levels and scenes.

Processing a standard game frame includes binding various resources to pipeline elements, setting the state of the pipeline, updating resources in the memory of the central processor (constant buffers, vertices, and indices) depending on the state of modeling processes, as well as updating resources in the memory of the graphic processor (visualization objects) , unordered access representations [UAV]) using render, send, and cleanup operations.

At the time the resource is created, an enumeration elementD3D11_USAGE is used to set the following resource parameters:

GP access for reading and writing (DEFAULT - for visualization objects, UAV elements, rarely updated constant buffers);
GPU read-only access (IMMUTABLE - for textures);
CPU access for writing + GPU access for reading (DYNAMIC - for frequently updated buffers);
CPU access with the ability for the GP to copy data to the resource (STAGING).

Note that to ensure use cases 3 and 4, you must correctly set the D3D11_CPU_ACCESS_FLAG flag for the resource.
There are three methods for updating resource data in the Direct3D 11 API, each of which performs certain tasks (as described above):

There is an interesting scenario that requires implicit synchronization: when the CPU has access to the resource for writing, and the GPU for reading. A similar scenario is often encountered during frame processing. Examples include updating the presentation matrix (model, projection) or converting the model matrix to the animation. Waiting for the GPU to finish using the resource would lead to an unjustified decrease in performance. Creating several independent resources (copies of the resource) to implement this scenario would complicate the task for the creators of the application. As a result, in the Direct3D interface of versions 9–11, this task is transferred to the driver using the DX11_MAP_WRITE_DISCARD flag. Each time a resource mapping is performed using this flag, the driver creates a new memory area for the resource that the central processor uses. Thus, the various Draw calls that update a given resource use different resource aliases, which undoubtedly increases the GPU memory utilization.
Additional information about resource management in Direct3D:

Presentation by John McDonald (John McDonald) Effective buffer management means (Efficient Buffer Management) at GDC conference
Resource Management Basics in Direct3D 11
resource selection in Direct3D 10
UpdateSubresource and Map methods

Tags: