Stencil buffer implementation example using CUDA

From the sandbox

Stencil buffer is used to mask reflections in those places where they actually are not. The Stencil technique is used in OpenGL and DirectX. Before applying the algorithm, a Stencil test is performed and in places where there is no image, the pixel shader will not be rendered. Thus, we suppress unnecessary work.

Stencil is stored in a buffer with depth. For example, in the D3DFMT_D24S8 format, 24 bits are depth bits and 8 bits are Stencil. For simplicity, we will further assume that the Stencil is stored in the last bit. If this bit = 1, then the pixel is active. Thus, the simplified algorithm presents the following steps:

Zero (fill with zeros) Stencil buffer.
We start recording and draw a plane in the Stencil buffer, relative to which we will consider reflection. Where there is a mirror, units will be stored, and where there is no mirror, zeros.
We reflect all the geometry relative to the plane using a separate matrix, and then draw a reflection, simultaneously running the Stencil test.

Thus, where the mirror was in the image, reflection will be displayed. And where there is none, nothing will change.

CUDA software implementation

CUDA, unfortunately, has no Stencil test mechanism. This is a very useful trick and I’ll explain how to get around this limitation in the next article, but now we’ll look at the implementation details.

So, we start the stencil buffer exactly the size of (N / 32) * sizeof (int) bytes. And bind the texture to it.

cudaMalloc((void**)&m_stencilBuffer, N*sizeof(int)/32);
cudaBindTexture(0, stencil_tex, m_stencilBuffer, N*sizeof(int)/32);
Сама текстура объявлена в каком-нибудь хедере (.h файл) следующим образом:
  Texture<int, 1, cudaReadModeElementType>   stencil_tex;
Далее,  в том же файле объявим такой вспомогательный массив:
static __device__ int g_stencilMask[32] = {
  0x00000001, 0x00000002, 0x00000004, 0x00000008, 0x00000010, 0x00000020, 0x00000040, 0x00000080,
  0x00000100, 0x00000200, 0x00000400, 0x00000800, 0x00001000, 0x00002000, 0x00004000, 0x00008000,
  0x00010000, 0x00020000, 0x00040000, 0x00080000, 0x00100000, 0x00200000, 0x00400000, 0x00800000,
  0x01000000, 0x02000000, 0x04000000, 0x08000000, 0x10000000, 0x20000000, 0x40000000, 0x80000000
};

For those kernels that only read the stencil buffer, use the macro at the beginning of the kernel as follows:

__global__ voidmy_kernel(…){
  uint tid = blockDim.x * blockIdx.x + threadIdx.x;
  STENCIL_TEST(tid);
  // my code here
}

In practice (GTX560), such a stencil test is approximately 20-25% faster than a simple check, type check:

uint activeFlag = a_flags[tid];
  if(activeFlag==0)
    return;

However, given the memory savings, there is definitely a profit. It should also be noted that on video cards with a less wide bus (for example GTS450), acceleration can be more significant.

So, it remains to implement only writing to the stencil buffer. First, we read the value for everything in warp from the stealth buffer to the activeWarp variable; Then each thread receives its bit from this variable using the logical & and stores it in the active variable. At the end of the kernel, we collect from all the active variables for a given warp the value back to one 32-bit uint, and the zero warp thread will write the result back to memory.

// (tid >> 5)   same as (tid/32)// (tid & 0x1f) same as (tid%32)
__global__ voidmy_kernel2(…,uint* a_stencilBuffer){
  uint tid = blockDim.x * blockIdx.x + threadIdx.x;
  uint activeWarp = a_stencilBuffer[tid >> 5];
  if(activeWarp==0) // all threads in warp inactivereturn;
  // each threads will store it's particular bit from group of 32 threads
  uint active = activeWarp & g_stencilMask[tid&0x1f];
  if(!active)
    goto WRITE_BACK_STENCIL_DATA;
  // my code here
    WRITE_BACK_STENCIL_DATA:
  WriteStencilBit(tid, a_stencilBuffer, active);
}

If the thread is inactive, it will immediately go to the kernel skate. If for any reason you inside your code decide that this thread should be inactive, do this:

if(want to kill thread)
{
  active = 0;
  goto WRITE_BACK_STENCIL_DATA;
}

The example intentionally uses the label and goto operator. Although this is a bad programming style, in this case it adds security to your code. The fact is that you are guaranteed to reach the WriteStencilBit function code. If for some reason you decide to make return inside your code, everything will break (a bit later we will discuss why). Instead, return must be set to goto WRITE_BACK_STENCIL_DATA so that before exiting, all streams from warp-a can collect data, and a zero stream (zero inside warp-a) writes them to the stencil buffer. Actually, the WriteStencilBit function looks like this:

__device__ voidWriteStencilBit(int tid, uint* a_stencilBuffer, uint value){
  uint stencilMask = __ballot(value);
  if((tid & 0x1f) == 0) // same as tid%32 == 0
    a_stencilBuffer[tid >> 5] = stencilMask;
}

The __ballot () function returns uint, where every ith bit is 1 if and only if what is in its argument is not zero. That is, it does exactly what it needs there, stitching flags back into uint from different threads inside the warp.

The __ballot () function belongs to the so-called “warp vote functions” and works very quickly. Unfortunately, it is available only for compute capability 2.0, that is, video cards with Fermi architecture. An important note on its operation, the following code will be incorrect:

__device__ voidWriteWrongStencilBit(int tid, uint* a_stencilBuffer, uint value){
  if((tid & 0x1f) == 0) // same as tid%32 == 0
    a_stencilBuffer[tid >> 5] = __ballot(value);
}

The fact is that __ballot () will always put 0 on those bits whose streams are currently masked. And all streams with a number inside the warp non-zero (1..31) will be masked and will not get inside the if statement, therefore 1..31 bits of the __ballot () function result for such a code will always be zero. From here the truth follows an interesting conclusion. If you are guaranteed to write for video cards with Fermi architecture, then even for kernels that write to the stencil buffer, you can kill the stream as follows:

if(want to kill thread) 
  return;

Thus, the streams for which you made return will be masked and __ballot () will return zeros for the corresponding bits in its result. There is really one subtlety. At least for a zero flow inside warp, you cannot do this, otherwise the result simply will not be written back. Therefore, in fact, you can only do this

if(want to kill thread && (tid&0x1f!=0)) 
  return;

Or use the form suggested above:

if(want to kill thread)
{
  active = 0;
  goto WRITE_BACK_STENCIL_DATA;
}

Implementation Features for Old Hardware (G80-GT200)

Now let us consider what extensions must be made for Stencil to work effectively on older GPUs. The __ballot () function is not supported on these video cards. We rewrite the WriteStencilBit function in accordance with the capabilities that we have:

template<int CURR_BLOCK_SIZE>
__device__ inlinevoidWriteStencilBit(int tid, uint* a_stencilBuffer, uint value){
#if COMPUTE_CAPABILITY >= COMPUTE_CAPABILITY_GF100
  uint stencilMask = __ballot(value);
  if((tid & 0x1f) == 0)
    a_stencilBuffer[tid >> 5] = stencilMask;
#elif COMPUTE_CAPABILITY >= COMPUTE_CAPABILITY_GT200if(__all(value==0))
  {
    if((tid & 0x1f) == 0)
      a_stencilBuffer[tid >> 5] = 0;
  }
  elseif(__all(value))
  {
    if((tid & 0x1f) == 0)
      a_stencilBuffer[tid >> 5] = 0xffffffff;
  }
  else
  {
    __shared__ uint active_threads[CURR_BLOCK_SIZE/32];
    uint* pAddr = active_threads + (threadIdx.x >> 5);
    if((tid & 0x1f) == 0)
      *pAddr = 0;
    atomicOr(pAddr, value);
    if((tid & 0x1f) == 0)
      a_stencilBuffer[tid >> 5] = *pAddr;
  }
#else
  __shared__ uint active_threads[CURR_BLOCK_SIZE];
  active_threads[threadIdx.x] = value;
  if((threadIdx.x & 0x1) == 0)
    active_threads[threadIdx.x] = value | active_threads[threadIdx.x+1];
  if((threadIdx.x & 0x3) == 0)
    active_threads[threadIdx.x] = active_threads[threadIdx.x] | active_threads[threadIdx.x+2]; 
  if((threadIdx.x & 0x7) == 0)
    active_threads[threadIdx.x] = active_threads[threadIdx.x] | active_threads[threadIdx.x+4]; 
  if((threadIdx.x & 0xf) == 0)
    active_threads[threadIdx.x] = active_threads[threadIdx.x] | active_threads[threadIdx.x+8]; 
  if((threadIdx.x & 0x1f) == 0)
    active_threads[threadIdx.x] = active_threads[threadIdx.x] | active_threads[threadIdx.x+16];
  uint* perWarpArray = active_threads + ((threadIdx.x >> 5) << 5);
  if((tid & 0x1f) == 0)
    a_stencilBuffer[tid >> 5] = perWarpArray[0];
#endif
}

Thus we can make atoms in shared memory + 2 voting functions, __any and __all are available, so that we can use them. In other cases, only the classical reduction remains.

Testing Stencil

For the needs of ray tracing, such a stencil buffer came up pretty well. On the GTX560 of my old laptop, I get about 4 billion kernel calls per second (that is, 4 billion empty calls per second) - not bad, right ?! With increasing trace depth, performance dropped slightly in accordance with how many real objects we see reflected. Tests were specially performed on the simplest possible reflective scene:

FPS dynamics is as follows: 30, 25, 23.7, 20, 19.4, 18.8

Tags:

Stencil buffer implementation example using CUDA

CUDA software implementation

Testing Stencil

Also popular now: