# Little DirectX and HLSL tricks

Hello, Habr! I decided to write an article about small tricks that I use in my modest engine. This is rather a note to myself, and experienced programmers will only grin, but I think it can come in handy for beginners.

### 1. Matrices in HLSL

Let's say in the vertex shader we need to rotate the normal (tangent, binormal) of the vertex and we have a world 4x4 matrix. But the shift, wired into the matrix, we do not need. Then we simply reduce the matrix to 3x3:

output.Normal = mul(input.Normal.xyz, (float3x3)RotM);

By the way, if you need to get an inverse matrix from a 3x3 rotation matrix, and at the same time it is orthogonal, then simply transpose it:

float3х3 invMat = transpose(Mat);

Or you can do without this, if you only need to get a vector transformed by an inverse matrix, then it’s enough to change the order of multiplication of the matrix and the vector:

float3 outVector = mul((float3x3)RotM, inVector.xyz);

You probably know that to access the matrix element, you can use a record like:

float value = World._m30;

However, the syntax allows you to get several values ​​from the matrix at once. For example, get the displacement from the transformation matrix:

float3 objPosition = World._m30_m31_m32;

### 2. Render without vertex buffer

DX11 has a great opportunity to send vertices to render without creating a vertex buffer for this. Code for C # and SharpDX wrapper:

System.IntPtr n_IntPtr = new System.IntPtr(0);
device.ImmediateContext.InputAssembler.InputLayout = null;
device.ImmediateContext.InputAssembler.SetVertexBuffers(0, 0, n_IntPtr, n_IntPtr, n_IntPtr);
device.ImmediateContext.InputAssembler.SetIndexBuffer(null, Format.R32_UInt, 0);
device.ImmediateContext.Draw(3, 0);

Here we send three peaks to render. And in the shader, for example, we can build a full-screen quad from them:

struct VertexInput
{
uint VertexID : SV_VertexID;
};
struct PixelInput
{
float4 Position : SV_POSITION;
};
PixelInput DefaultVS(VertexInput input)
{
PixelInput output = (PixelInput)0;
uint id = input.VertexID;
float x = -1, y = -1;
x = (id == 2) ? 3.0 : -1.0;
y = (id == 1) ? 3.0 : -1.0;
output.Position = float4(x, y, 1.0, 1.0);
return output;
}

### 3. Render without pixel shader

Another useful feature is rendering without a pixel shader. This allows you to significantly optimize the time for rendering in some cases. For example, when preparing depth, or when rendering shadows. We just don’t install the pixel shader in our pipeline:

pass GS_PSSM
{
SetBlendState(NoBlending, float4(0.0f, 0.0f, 0.0f, 0.0f), 0xFFFFFFFF);
SetDepthStencilState(EnableDepth, 0);
}

Or:

In both cases, the pixel shader will not be executed, and the depth interpolated in the vertex shader will be written to the render target.

You can go further and install a pixel shader that returns nothing:

void ZPrepasPS(PixelInputZPrePass input)
{
float4 albedo = AlbedoMap.Sample(Aniso, input.UV.xy);
if (albedo.w < AlphaTest.x)
}

In this case, an alpha test is performed. And if it is not passed, then the pixel will be thrown out of the pipeline. If everything is in order, then, similarly to the previous case, the depth interpolated by the vertex shader will be written to the render target.

### 4. Alpha to coverage

DX10 / 11 has a great ability to smoothly alfatest with hardware using MSAA. In simple words, this is an opportunity in a pixel shader to independently indicate how many samples of each pixel of the MSAA render target have passed the test.

static const float2 MSAAOffsets8[8] =
{
float2(0.0625, -0.1875), float2(-0.0625, 0.1875),
float2(0.3125, 0.0625), float2(-0.1875, -0.3125),
float2(-0.3125, 0.3125), float2(-0.4375, -0.0625),
float2(0.1875, 0.4375), float2(0.4375, -0.4375)
};
void ZPrepasPSMS8(PixelInputZPrePass input, out uint coverage : SV_Coverage)
{
coverage = 0;
[branch]
if (AlphaTest.x <= 1 / 255.0)
coverage = 255;
else
{
float2 tc_ddx = ddx(input.UV.xy);
float2 tc_ddy = ddy(input.UV.xy);
[unroll]
for (int i = 0; i < 8; i++)
{
float2 texelOffset = MSAAOffsets8[i].x * tc_ddx + v2MSAAOffsets8[i].y * tc_ddy;
float temp = AlbedoMap.Sample(Aniso, input.UV.xy + texelOffset).w;
if (temp >= 0.5)
coverage |= 1 << i;
}
}
}

My alpha test takes place only at the Z-prepas stage. After the final pass, it is enough for us to resolve the MSAA buffer and our alpha test will be smoothed out like normal geometry (the correct resolution of the HDR MSAA buffer is a topic for a separate article).

Comparative screenshots

### 5. Screen anti-aliasing of normals

This idea came to me after the implementation of the previous paragraph. I am superskilling from a normal texture with the UV offset calculated in the screenspace. Since I use the Forward + approach with Z-prepas, such an operation is minimal.

static const float2 MSAAOffsets4[4] =
{
float2(-0.125, -0.375), float2(0.375, -0.125),
float2(-0.375, 0.125), float2(0.125, 0.375)
};
float3 ONormal = float3(0,0,0);
float2 tc_ddx = ddx(input.UV.xy);
float2 tc_ddy = ddy(input.UV.xy);
[unroll]
for (int i = 0; i < 4; i++)
{
float2 texelOffset = MSAAOffsets4[i].x * tc_ddx + MSAAOffsets4[i].y * tc_ddy;
float4 temp = NormalMap.Sample(Aniso, input.UV.xy + texelOffset*1.5);
ONormal += temp.ywy;
}
ONormal *= 0.25;
Normal = ONormal * 2.0f - 1.0f;

Comparative screenshots

### 6. Normals of bilateral geometry

To avoid lighting artifacts, for the double side triangles you need to invert the normal if we look at the opposite side:

float3 FinalPS(PixelInput input, bool isFrontFace : SV_IsFrontFace) : SV_Target
{
input.Normal *= (1 - isFrontFace * 2);
...

### 7. Find out the texture size in the shader

I don’t use this opportunity myself, since there are doubts about its performance, but it may be useful to someone:

Texture2D texture;
uint width, height;
texture.GetDimensions(width, height);

### 8. Sprites with geometric shaders

With the advent of geometric shaders, it has become possible to make various optimizations. For example, speed up rendering sprites. Single vertices containing all the information about the sprite are sent to the video card. In the geometric shader, a full-fledged sprite is constructed from them:

struct VS_IN
{
float4 Position : POSITION;
float4 UV       : TEXCOORD0;
float4 Rotation : TEXCOORD1;
float4 Color    : TEXCOORD2;
};
struct VS_OUT
{
float4 Position : SV_POSITION;
float4 UV       : TEXCOORD0;
float4 Rotation : TEXCOORD1;
float4 Color    : TEXCOORD2;
};
struct GS_OUT
{
float4 Position : SV_POSITION;
float2 TexCoord	: TEXCOORD0;
float4 Color    : TEXCOORD1;
}
VS_OUT GSSprite_VS( VS_IN Input )
{
VS_OUT Output;
float2 center = (Input.Position.xy + Input.Position.zw) * 0.5;
float2 size = (Input.Position.zw - center)*2.0;
Output.Position = float4(center, size);
Output.UV = Input.UV;
Output.Color = Input.Color;
Output.Rotation = Input.Rotation;
return Output;
}
[maxvertexcount(6)]
void GSSprite_GS(point VS_OUT In[1], inout TriangleStream triStream)
{
GS_OUT p0 = (GS_OUT) 0;
GS_OUT p1 = (GS_OUT) 0;
GS_OUT p2 = (GS_OUT) 0;
GS_OUT p3 = (GS_OUT) 0;
In[0].Position.xy = In[0].Position.xy * Resolution.zw * 2.0 - 1.0;
In[0].Position.y = -In[0].Position.y;
float2 r = float2(In[0].Rotation.x, -In[0].Rotation.y);
float2 t = float2(In[0].Rotation.y, In[0].Rotation.x);
p0.Position = float4(In[0].Position.xy + (-In[0].Position.z * r + In[0].Position.w * t) * Resolution.zw, 0.5, 1.0);
p0.TexCoord = In[0].UV.xy;
p0.Color = In[0].Color;
p1.Position = float4(In[0].Position.xy + (In[0].Position.z * r + In[0].Position.w * t) * Resolution.zw, 0.5, 1.0);
p1.TexCoord = In[0].UV.zy;
p1.Color = In[0].Color;
p2.Position = float4(In[0].Position.xy + (In[0].Position.z * r - In[0].Position.w * t) * Resolution.zw, 0.5, 1.0);
p2.TexCoord = In[0].UV.zw;
p2.Color = In[0].Color;
p3.Position = float4(In[0].Position.xy + (-In[0].Position.z * r - In[0].Position.w * t) * Resolution.zw, 0.5, 1.0);
p3.TexCoord = In[0].UV.xw;
p3.Color = In[0].Color;
triStream.Append(p0);
triStream.Append(p1);
triStream.Append(p2);
triStream.RestartStrip();
triStream.Append(p0);
triStream.Append(p2);
triStream.Append(p3);
triStream.RestartStrip();
}

According to my measurements, this approach gives about 20-30% acceleration on both weak and powerful hardware.

### 9. Lens Flare

I use a similar approach for drawing lens effects. Only a visibility check I do just before the construction of the sprite. First I check how far the effect is from the edges of the screen. Then there is a check for the percentage of the effect overlapping by the objects in the depth buffer. If both checks are passed, then I’ll construct the sprite:

static const int2 offset[61] = {
int2( 0, 0), int2( 1, 0), int2( 1,-1), int2( 0,-1), int2(-1,-1), int2(-1, 0), int2(-1, 1), int2( 0, 1),
int2( 1, 1), int2( 2, 0), int2( 2,-1), int2( 2,-2), int2( 1,-2), int2( 0,-2), int2(-1, 2), int2(-2,-2),
int2(-2,-1), int2(-2, 0), int2(-2, 1), int2(-2, 2), int2(-1, 2), int2( 0, 2), int2( 1, 2), int2( 2, 2),
int2( 2, 1), int2( 3, 0), int2( 3,-1), int2( 1,-3), int2( 0,-3), int2(-1,-3), int2(-3,-1), int2(-3, 0),
int2(-3, 1), int2(-1,-3), int2( 0, 3), int2( 1, 3), int2( 3, 1), int2( 4, 0), int2( 4,-1), int2( 3,-2),
int2( 3,-3), int2(-2,-3), int2( 1,-4), int2( 0,-4), int2(-1,-4), int2(-2,-3), int2( 3,-3), int2(-3,-2),
int2(-4,-1), int2(-4, 0), int2(-4, 1), int2(-3, 2), int2(-3, 3), int2(-2, 3), int2(-1, 4), int2( 0, 4),
int2( 1, 4), int2( 2, 3), int2( 3, 3), int2( 3, 2), int2( 4, 1)};
[maxvertexcount(6)]
void GSSprite_GS(point VS_OUT In[1], inout TriangleStream triStream, uniform bool MSAA)
{
LensFlareStruct LFS = LensFlares[In[0].VertexID];
float4 Position = mul(LFS.Direction, ViewProection);
float3 NPos = Position.xyz / Position.w;
float dist = NPos.x - -1;
dist = min(1 - NPos.x, dist) * ScrRes.z; //Proportion
dist = min(NPos.y - -1, dist);
dist = min(1 - NPos.y, dist);
dist = min(NPos.z < 0.9, dist);
dist = saturate(dist * 20);
if (dist > 0)
{
float2 SPos = float2(NPos.x, -NPos.y) * 0.5 + 0.5;
int2 LPos = round(SPos * ScrRes.xy);
float v = 0;
if (MSAA)
{
for (int i = 0; i < 61; i++)
v += DepthTextureMS.Load(LPos + offset[i],  0) < NPos.z;
}
else
{
for (int i = 0; i < 61; i++)
v += DepthTexture.Load(uint3(LPos + offset[i], 0)) < NPos.z;
}
v = pow(v / 61.0, 2.0);
dist *= v;
if (dist > 0)
{
float2 Size = LFS.Size.xy * float2(ScrRes.w, 1);
Quad(triStream, Position, LFS.UV, Size * saturate(dist + 0.1), LFS.Color.xyz * dist);
}
}
}

### 10. PSSM render using geometric shaders

Another great example is Parallel-Split Shadow Maps optimization with geometric shaders from GPU Gems. In place of sending a separate dip to render an object in each split, we can duplicate the geometry using the video card and render it into different render targets for one dip:

{
float4 pos : SV_POSITION;
float4 UV1 : TEXCOORD0;
nointerpolation uint instId  : SV_InstanceID;
};
struct GS_OUT
{
float4 pos : SV_POSITION;
float2 Texcoord : TEXCOORD0;
nointerpolation uint RTIndex : SV_RenderTargetArrayIndex;
};
[maxvertexcount(SPLITCOUNT * 3)]
{
// For each split to render
for (int split = IstanceData[In[0].instId].Start; split <= IstanceData[In[0].instId].Stop; split++)
{
GS_OUT Out;
// Set render target index.
Out.RTIndex = split;
// For each vertex of triangle
[unroll(3)]
for (int vertex = 0; vertex < 3; vertex++)
{
// Transform vertex with split-specific crop matrix.
Out.pos = mul(In[vertex].pos, cropMatrix[split]);
Out.Texcoord = In[vertex].UV1.xy;
// Append vertex to stream
triStream.Append(Out);
}
// Mark end of triangle
triStream.RestartStrip();
}
}

### 11. Instancing

With the transition to DX11, rendering with the use of instantiation has become much easier. Now it is not necessary to create an additional stream of vertices with information for each instance. You can simply specify how many instances we need:

device.ImmediateContext.DrawIndexedInstanced(IndicesCount, Meshes.Count, StartInd, 0, 0);

And then, in the shader, get its index for each instance and determine the necessary additional information from it:

struct PerInstanceData
{
float4x4 WVP;
float4x4 World;
int Start;
int Stop;
};
StructuredBuffer IstanceData : register(t16);
PixelInput DefaultVS(VertexInput input, uint id : SV_InstanceID)
{
PixelInput output = (PixelInput) 0;
output.Position = mul(float4(input.Position.xyz, 1), IstanceData[id].WVP);
output.UV.xy = input.UV;
output.WorldPos = mul(float4(input.Position, 1), IstanceData[id].World).xyz;
...

### 12. Convert 2D UV and side index to vector for cubmap

It is useful when working with cubmaps.

static const float3 offsetV[6] = { float3(1,1,1),  float3(-1,1,-1), float3(-1,1,-1),	float3(-1,-1,1), float3(-1,1,1), float3(1,1,-1) };
static const float3 offsetX[6] = { float3(0,0,-2), float3(0,0,2),   float3(2,0,0),		float3(2,0,0),   float3(2,0,0),  float3(-2,0,0) };
static const float3 offsetY[6] = { float3(0,-2,0), float3(0,-2,0),  float3(0,0,2),		float3(0,0,-2),  float3(0,-2,0), float3(0,-2,0) };
float3 ConvertUV(float2 UV, int FaceIndex)
{
float3 outV = offsetV[FaceIndex] + offsetX[FaceIndex] * UV.x + offsetY[FaceIndex] * UV.y;
return normalize(outV);
}

### 13. Optimization of the Gauss filter

And for starters, an easy way to optimize Gauss. We use hardware filtering - select two adjacent pixels, with a pre-calculated shift between them. Thus, we minimize the total number of samples.

static const float Shift[4] = {0.4861161486, 0.4309984373, 0.3775380497, 0.3269038909 };
static const float Mult[4] = {0.194624, 0.189416, 0.088897, 0.027063 };
float3 GetGauss15(Texture2D Tex, float2 UV, float2 dx)
{
float3 rez = 0;
for (int i = 1; i < 4; i++)
rez += (Tex.Sample(LinSampler, UV + (Shift[i] + i*2)*dx ).xyz + Tex.Sample(LinSampler, UV - (Shift[i] + i*2)*dx).xyz) * Mult[i];
rez += Tex.Sample( LinSampler, UV ).xyz * 0.134598;
rez += (Tex.Sample( LinSampler, UV + dx ).xyz + Tex.Sample( LinSampler, UV - dx ).xyz )* 0.127325;
return rez;
}

That's the whole damn dozen, I hope the material will be useful to someone.