Using Stream Out Stage to Debug Shaders in DirectX 10 \ 11

Original author: jollyjeffers
  • Transfer
  • Tutorial

In early March, I had the pleasure of visiting the Direct3D development team at Microsoft's headquarters in Redmond. In the course of one of the discussions about debugging 3D applications, they advised me to use the new DirectX10 \ 11 feature for debugging shaders.

I used this technique to debug tessellation code for DirectX 11 (this code is given below), but DirectX 10 has the same features and porting will be quite trivial.

What are we trying to do?

We are interested in getting the results of work performed on GPU shaders (vertex, geometric, tessellation) for the subsequent processing of this data using the CPU. At the same time, we want to see the results of calculating the graphics on the screen, and have all the coordinates in the form of buffers and structures in RAM, from where we can already read them, write to the log, and use them for further calculations.

Let's get down to business

You need to perform 4 basic steps:

Modify your shaders
You need to add additional fields to the shader output that we want to receive. For example, in the normal state, your shader may not display world-space coordinates, but for debug output through the Stream Out stage you can add them.

Change the way a geometric shader is created.
Designing an ID3D11GeometryShader (or ID3D10GeometryShader) and adding it to a pipeline will happen differently.

Create a buffer to get the output Logically
enough - you need to store the results somewhere.

Decrypt Results
The received data in the buffer is an array of structures, each of which contains information about the vertex in a format defined by the shader. The easiest way to decode a buffer is to declare the structure in the same format, and then cast the pointer to the beginning of the buffer to a pointer to an array of the above structures.

So, modifying the shaders

As you may know, Direct3D supports the “pass forward” mechanism. This means that the results of the previous stage output Pipeline transferred to the next step (and has not come back). Thus, if you want to derive some additional data from the vertex shader, you will have to “stretch” them through the HS / DS / GS pipeline stage.

Let's look at such a geometric shader:

struct DS_OUTPUT
{
	float4 position : SV_Position;
	float3 colour : COLOUR;
	float3 uvw : DOMAIN_SHADER_LOCATION;
	float3 wPos : WORLD_POSITION;
};
[maxvertexcount(3)]
void gsMain( triangle DS_OUTPUT input[3], inout TriangleStream TriangleOutputStream )
{
    TriangleOutputStream.Append( input[0] );
    TriangleOutputStream.Append( input[1] );
    TriangleOutputStream.Append( input[2] );
    TriangleOutputStream.RestartStrip();
}


This geometric shader is completely “transparent” - it simply redirects input to output. Pay attention to the DS_OUTPUT structure - in the future we will choose which elements of this structure we want to receive.

It should be noted that your pixel shaders do not require changes. In the example above, the pixel shader will receive only the second parameter of the structure - float3 color: COLOR and ignore all other parameters. Thus, we will use the simplest idea: all the new fields that we want to output to the Stream Out stages will simply be added to the end of the DS_OUTPUT structure.

Now we modify the procedure for creating a geometric shader. You need to call the CreateGeometryShaderWithStreamOutput () method instead of CreateGeometryShader (), passing the structure D3D11_SO_DECLARATION_ENTRY (or D3D10_SO_DECLARATION_ENTRY - depending on which version of DirectX you are using) to describe the vertex format in addition to the shader.

D3D11_SO_DECLARATION_ENTRY soDecl[] = 
{
	{ 0, "COLOUR", 0, 0, 3, 0 }
	, { 0, "DOMAIN_SHADER_LOCATION", 0, 0, 3, 0 }
	, { 0, "WORLD_POSITION", 0, 0, 3, 0 }
};
UINT stride = 9 * sizeof(float); // *NOT* sizeof the above array!
UINT elems = sizeof(soDecl) / sizeof(D3D11_SO_DECLARATION_ENTRY);


There are three things to pay attention to:
  1. Semantic names: they must correspond to be written in the HLSL code of your shader. Note that in the structure above we select three fields from the four fields declared in the geometric shader.
  2. The initial element and the number of elements: for the data type float3 we want to get all three coordinates, starting from zero, respectively, the initial element is 0, the number is 3.
  3. Step (offset) between two adjacent vertices: calling CreateGeometryShaderWithStreamOutput () requires knowing the size of the structure that describes the vertex. It is not so difficult to calculate, but you can make a mistake and pass the size of the soDecl structure, which will be incorrect.


Now you need to create a buffer to get the results. It is created in much the same way that you create vertex and index buffers. We need two buffers - one readable from the GPU, the second readable from the CPU.

D3D11_BUFFER_DESC soDesc;
soDesc.BindFlags			= D3D11_BIND_STREAM_OUTPUT;
soDesc.ByteWidth			= 10 * 1024 * 1024; // 10mb
soDesc.CPUAccessFlags		= 0;
soDesc.Usage				= D3D11_USAGE_DEFAULT;
soDesc.MiscFlags			= 0;
soDesc.StructureByteStride	= 0;
if( FAILED( hr = g_pd3dDevice->CreateBuffer( &soDesc, NULL, &g_pStreamOutBuffer ) ) )
{
	/* handle the error here */
	return hr;
}
// Simply re-use the above struct
soDesc.BindFlags		= 0;
soDesc.CPUAccessFlags	= D3D11_CPU_ACCESS_READ;
soDesc.Usage			= D3D11_USAGE_STAGING;
if( FAILED( hr = g_pd3dDevice->CreateBuffer( &soDesc, NULL, &g_pStagingStreamOutBuffer ) ) )
{
	/* handle the error here */
	return hr;
}


You cannot call the Map () method on the buffer created with the D3D11_USAGE_DEFAULT flag and you cannot bind the buffer with the D3D11_CPU_ACCESS_READ flag to the Stream Out stage of the pipeline, so you create one buffer of each type and copy data from one to the other.

Now bind the buffer to the Stream Out stage:

UINT offset = 0;
g_pContext->SOSetTargets( 1, &g_pStreamOutBuffer, &offset );
Ну и давайте наконец прочитаем результаты из буфера:
g_pContext->CopyResource( g_pStagingStreamOutBuffer, g_pStreamOutBuffer );
D3D11_MAPPED_SUBRESOURCE data;
if( SUCCEEDED( g_pContext->Map( g_pStagingStreamOutBuffer, 0, D3D11_MAP_READ, 0, &data ) ) )
{
	struct GS_OUTPUT
	{
		D3DXVECTOR3 COLOUR;
		D3DXVECTOR3 DOMAIN_SHADER_LOCATION;
		D3DXVECTOR3 WORLD_POSITION;
	};
	GS_OUTPUT *pRaw = reinterpret_cast< GS_OUTPUT* >( data.pData );
	/* Work with the pRaw[] array here */
	// Consider StringCchPrintf() and OutputDebugString() as simple ways of printing the above struct, or use the debugger and step through.
	g_pContext->Unmap( g_pStagingStreamOutBuffer, 0 );
}


All of the above must be done after the drawing call. You need to be careful with the structure to which you are converting the contents of the buffer to the pointer (consider alignment).

How much data is received? We can write the code using the query D3D11_QUERY_PIPELINE_STATISTICS in order to find out.

// When initializing/loading
D3D11_QUERY_DESC queryDesc;
queryDesc.Query = D3D11_QUERY_PIPELINE_STATISTICS;
queryDesc.MiscFlags = 0;
if( FAILED( hr = g_pd3dDevice->CreateQuery( &queryDesc, &g_pDeviceStats ) ) )
{
	return hr;
}
// When rendering
g_pContext->Begin(g_pDeviceStats);
g_pContext->DrawIndexed( 3, 0, 0 ); // one triangle only
g_pContext->End(g_pDeviceStats);
D3D11_QUERY_DATA_PIPELINE_STATISTICS stats;
while( S_OK != g_pContext->GetData(g_pDeviceStats, &stats, g_pDeviceStats->GetDataSize(), 0 ) );


Any restrictions?


Unfortunately yes.

  • The productivity of this whole thing is not very high. Still, we have to copy data from video memory to RAM, which is not very fast. It must, however, be remembered that all this is a debugging mechanism and, most likely, this technique will not be used in the production code.
  • This trick does not work for pixel shaders. The pixel shader in the pipeline is already after the Stream Out stage.
  • This technique requires changing shaders - i.e. code base of your project. You will have to either use different shaders in debug and release builds, or put up with some performance drop in the release.
  • We are attached to the main pipeline - we cannot get the information we need neither more often, nor less often than every frame is drawn.
  • There are some restrictions on the overall size of the data structure describing the vertex format - for DirectX10, these are 64 scalar values ​​or 2 KB of vector type data.

Also popular now: