OpenGL ES 2.0. One million particles
- Tutorial
In this article, we will consider one of the options for implementing a particle system on OpenGL ES 2.0. We will talk in detail about limitations, describe the principles and analyze a small example.

In general, we will require two additional properties from OpenGL ES 2.0 (specification does not require their availability):
Processor Information:
There is a problem with NVIDIA Tegra 2/3/4 , a number of popular devices such as Nexus 7, HTC One X, ASUS Transformer work on this series .
Considering the particle systems generated on the CPU in the context of increasing the amount of processed data (number of particles), the main performance problem is copying (unloading) data from RAM into the video device memory on each frame. Therefore, our main task is to avoid this copying by transferring the calculations to non-operational mode on the GPU.
The essence of the method is to use values (coordinates, accelerations, etc.) - textures as the data storage buffer characterizing the particle, and process them using vertex and fragment shaders. Just like we store and load normals, talking about Normal mapping . The size of the buffer, in our case, is proportional to the number of particles processed. Each texel stores a separate value (values, if there are several) for a single particle. Accordingly, the number of processed quantities is inversely related to the number of particles. For example, in order to process positions and accelerations for 1048576 particles, we need two 1024x1024 textures (if there is no need to maintain aspect ratio)
There are additional restrictions that we need to consider. To be able to record any information, the texture pixel data format must be supported by the implementation as a color-renderable format . This means that we can use the texture as a color buffer as part of the off -screen rendering . The specification describes only three such formats: GL_RGBA4, GL_RGB5_A1, GL_RGB565 . Given the subject area, we need at least 32 bits per pixel to process values such as coordinates or accelerations (for the two-dimensional case). Therefore, the formats mentioned above are not enough for us.
To provide the necessary minimum, we will consider two additional types of textures: GL_RGBA8and GL_RGBA16F . Such textures are often called LDR (SDR) and HDR textures, respectively.
According to GPUINFO for 2013 - 2015, support for extensions is as follows:
Generally speaking, HDR textures are more suitable for our purposes. First, they allow us to process more information without sacrificing performance, for example, manipulating particles in three-dimensional space without increasing the number of buffers. Secondly, there is no need for intermediate mechanisms for unpacking and packing data when reading and writing, respectively. But, due to the weak support for HDR textures, we will opt for LDR.
So, back to the point, the general scheme of what we will do looks like this:

The first thing we need is to break the calculations into passes. The partitioning depends on the quantity and type of characterizing quantities that we are going to process. Based on the fact that we have a texture as a data buffer and taking into account the restrictions on the format of pixel data described above, each pass can process no more than 32 bits of information for each particle. For example, on the first pass, we calculated the accelerations (32 bits, 16 bits per component), on the second, we updated the positions (32 bits, 16 bits per component).
Each pass processes data in double buffering mode. This provides access to the system state of the previous frame.
Pass core, this is a normal texture mappinginto two triangles, where our data buffers act as texture maps. The general view of the shaders is as follows:
The implementation of the unpacking / packing functions depends on the values that we process. At this stage, we rely on the requirement described at the beginning about high accuracy of calculations.
For example, for two-dimensional coordinates (components [x, y] of 16 bits), the functions may look like this:
After the calculation stage, the rendering stage follows. To access the particles at this stage, we need, for enumeration, some external index. Vertex Buffer Object with texture coordinates of the data buffer will act as such an index . The index is created and initialized (uploaded to the video device memory) once and does not change in the process.
At this step, the requirement for access to texture maps takes effect. The vertex shader is similar to the fragment shader from the calculation stage:
As a small example, we will try to generate a dynamic system of 1,048,576 particles, known as a Strange Attractor . Frame processing consists of several stages:


At the stage of calculations, we will have only one independent passage responsible for the positioning of particles. It is based on a simple formula:
This system is also called Peter de Jong Attractors . Over time, we will only change the coefficients.
In the scene rendering stage, we will render our particles with regular sprites.
In conclusion, at the post-processing stage , we will impose several
Gradient mapping effects . Adds color content based on the brightness of the original image.
Bloom . Adds a little glow.
Project repository on GitHub .
Currently available:
In this example, we see that the stage of calculations, the stage of rendering the scene, and post-processing consist of several passes dependent on each other.
In the next part, we will try to consider the implementation of multi-pass rendering, taking into account the requirements imposed by each stage.
I will be glad to comments and suggestions (it is possible by mail yegorov.alex@gmail.com)
Thank you!

Limitations
In general, we will require two additional properties from OpenGL ES 2.0 (specification does not require their availability):
- Vertex Texture Fetch . Allows us to access texture maps through texture units from the vertex shader. You can request the maximum number of units supported by the GPU using the glGetIntegerv function with the parameter name GL_MAX_VERTEX_TEXTURE_IMAGE_UNITS. The table below presents data on today's popular processors.
- Fragment high floating-point precision . Allows us to perform high precision calculations in the fragment shader. You can request accuracy and range of values using the glGetShaderPrecisionFormat function with the parameter names GL_FRAGMENT_SHADER and GL_HIGH_FLOAT for the shader type and data type, respectively. For all processors listed in the table, the accuracy is 23 bits with a range of values from -2 ^ 127 to 2 ^ 127, with the exception of Snapdragon Andreno 2xx , for this series the range is from -2 ^ 62 to 2 ^ 62.
Processor Information:
| CPU | Vertex TIU | Accuracy | Range | 
|---|---|---|---|
| Snapdragon adreno 2xx | 4 | 23 | [-2 ^ 62, 2 ^ 62] | 
| Snapdragon Adreno 3xx | 16 | 23 | [-2 ^ 127, 2 ^ 127] | 
| Snapdragon Adreno 4xx | 16 | 23 | [-2 ^ 127, 2 ^ 127] | 
| Snapdragon Adreno 5xx | 16 | 23 | [-2 ^ 127, 2 ^ 127] | 
| Intel HD Graphics | 16 | 23 | [-2 ^ 127, 2 ^ 127] | 
| ARM Mali-T6xx | 16 | 23 | [-2 ^ 127, 2 ^ 127] | 
| ARM Mali-T7xx | 16 | 23 | [-2 ^ 127, 2 ^ 127] | 
| ARM Mali-T8xx | 16 | 23 | [-2 ^ 127, 2 ^ 127] | 
| NVIDIA Tegra 2/3/4 | 0 | 0 | 0 | 
| NVIDIA Tegra K1 / X1 | 32 | 23 | [-2 ^ 127, 2 ^ 127] | 
| PowerVR SGX (Series5) | 8 | 23 | [-2 ^ 127, 2 ^ 127] | 
| PowerVR SGX (Series5XT) | 8 | 23 | [-2 ^ 127, 2 ^ 127] | 
| PowerVR Rogue (Series6) | 16 | 23 | [-2 ^ 127, 2 ^ 127] | 
| PowerVR Rogue (Series6XT) | 16 | 23 | [-2 ^ 127, 2 ^ 127] | 
| VideoCore IV | 8 | 23 | [-2 ^ 127, 2 ^ 127] | 
| Vivante gc1000 | 4 | 23 | [-2 ^ 127, 2 ^ 127] | 
| Vivante gc4000 | 16 | 23 | [-2 ^ 127, 2 ^ 127] | 
There is a problem with NVIDIA Tegra 2/3/4 , a number of popular devices such as Nexus 7, HTC One X, ASUS Transformer work on this series .
Particle system
Considering the particle systems generated on the CPU in the context of increasing the amount of processed data (number of particles), the main performance problem is copying (unloading) data from RAM into the video device memory on each frame. Therefore, our main task is to avoid this copying by transferring the calculations to non-operational mode on the GPU.
Recall that in OpenGL ES 2.0 there are no built-in mechanisms such as Transform Feedback (available in OpenGL ES 3.0) or Compute Shader (available in OpenGL ES 3.1) that allow you to perform calculations on the GPU.
The essence of the method is to use values (coordinates, accelerations, etc.) - textures as the data storage buffer characterizing the particle, and process them using vertex and fragment shaders. Just like we store and load normals, talking about Normal mapping . The size of the buffer, in our case, is proportional to the number of particles processed. Each texel stores a separate value (values, if there are several) for a single particle. Accordingly, the number of processed quantities is inversely related to the number of particles. For example, in order to process positions and accelerations for 1048576 particles, we need two 1024x1024 textures (if there is no need to maintain aspect ratio)
There are additional restrictions that we need to consider. To be able to record any information, the texture pixel data format must be supported by the implementation as a color-renderable format . This means that we can use the texture as a color buffer as part of the off -screen rendering . The specification describes only three such formats: GL_RGBA4, GL_RGB5_A1, GL_RGB565 . Given the subject area, we need at least 32 bits per pixel to process values such as coordinates or accelerations (for the two-dimensional case). Therefore, the formats mentioned above are not enough for us.
To provide the necessary minimum, we will consider two additional types of textures: GL_RGBA8and GL_RGBA16F . Such textures are often called LDR (SDR) and HDR textures, respectively.
- GL_RGBA8 is supported by the specification, we can load and read textures with this format. For recording, we need to require the OES_rgb8_rgba8 extension .
- GL_RGBA16F is not supported by the specification, in order to load and read textures with this format, we need the GL_OES_texture_half_float extension . Moreover, in order to get an acceptable quality result, we need support for linear minification and magnification filters of such textures. The GL_OES_texture_half_float_linear extension is responsible for this . For the record we need the extension GL_EXT_color_buffer_half_float .
According to GPUINFO for 2013 - 2015, support for extensions is as follows:
| Expansion | Devices (%) | 
|---|---|
| OES_rgb8_rgba8 | 98.69% | 
| GL_OES_texture_half_float | 61.5% | 
| GL_OES_texture_half_float_linear | 43.86% | 
| GL_EXT_color_buffer_half_float | 32.78% | 
Generally speaking, HDR textures are more suitable for our purposes. First, they allow us to process more information without sacrificing performance, for example, manipulating particles in three-dimensional space without increasing the number of buffers. Secondly, there is no need for intermediate mechanisms for unpacking and packing data when reading and writing, respectively. But, due to the weak support for HDR textures, we will opt for LDR.
So, back to the point, the general scheme of what we will do looks like this:

The first thing we need is to break the calculations into passes. The partitioning depends on the quantity and type of characterizing quantities that we are going to process. Based on the fact that we have a texture as a data buffer and taking into account the restrictions on the format of pixel data described above, each pass can process no more than 32 bits of information for each particle. For example, on the first pass, we calculated the accelerations (32 bits, 16 bits per component), on the second, we updated the positions (32 bits, 16 bits per component).
Each pass processes data in double buffering mode. This provides access to the system state of the previous frame.
Pass core, this is a normal texture mappinginto two triangles, where our data buffers act as texture maps. The general view of the shaders is as follows:
// Вершинный шейдер
attribute vec2 a_vertex_xy;
attribute vec2 a_vertex_uv;
varying vec2 v_uv;
void main()
{
    gl_Position = vec4(a_vertex_xy, 0.0, 1.0);
    v_uv = a_vertex_uv;
}// Фрагментный шейдер
precision highp float;
varying vec2 v_uv;
// состояние предыдущего кадра
// (если необходимо)
uniform sampler2D u_prev_state;
// данные из предшествующих проходов
// (если необходимо)
uniform sampler2D u_pass_0;
...
uniform sampler2D u_pass_n;
// функции распаковки
 unpack(vec4 raw);
 unpack_0(vec4 raw);
...
 unpack_1(vec4 raw)
// функция запаковки
vec4 pack( data);
void main()
{
    // распаковываем необходимые для расчетов данные
    // для частицы с индексом v_uv
     data = unpack(texture2D(u_prev_state, v_uv));
     data_pass_0 = unpack_0(texture2D(u_pass_0, v_uv));
    ...
     data_pass_n = unpack_n(texture2D(u_pass_n, v_uv));
    // получаем результат
     result = ...
    // упаковываем и сохраняем результат
    gl_FragColor = pack(result);
}        The implementation of the unpacking / packing functions depends on the values that we process. At this stage, we rely on the requirement described at the beginning about high accuracy of calculations.
For example, for two-dimensional coordinates (components [x, y] of 16 bits), the functions may look like this:
vec4 pack(vec2 value)
{
    vec2 shift = vec2(255.0, 1.0);
    vec2 mask = vec2(0.0, 1.0 / 255.0);
    vec4 result = fract(value.xxyy * shift.xyxy);
    return result - result.xxzz * mask.xyxy;
}
vec2 unpack(vec4 value)
{
    vec2 shift = vec2(1.0 / 255.0, 1.0);
    return vec2(dot(value.xy, shift), dot(value.zw, shift));
}Drawing
After the calculation stage, the rendering stage follows. To access the particles at this stage, we need, for enumeration, some external index. Vertex Buffer Object with texture coordinates of the data buffer will act as such an index . The index is created and initialized (uploaded to the video device memory) once and does not change in the process.
At this step, the requirement for access to texture maps takes effect. The vertex shader is similar to the fragment shader from the calculation stage:
// Вершинный шейдер
// индекс данных
attribute vec2 a_data_uv;
// буфер данных, хранящий позиции
uniform sampler2D u_positions;
// дополнительные данные (если необходимо)
uniform sampler2D u_data_0;
...
uniform sampler2D u_data_n;
// функция распаковки позиций
vec2 unpack(vec4 data);
// функции распаковки дополнительных данных
 unpack_0(vec4 data);
    ...
 unpack_n(vec4 data);
void main()
{
    // распаковываем и обрабатываем позиции частиц
    vec2 position = unpack(texture2D(u_positions, a_data_uv));
    gl_Position = vec4(position * 2.0 - 1.0, 0.0, 1.0);
    // распаковываем и обрабатываем дополнительные данные
     data_0 = unpack(texture2D(u_data_0, a_data_uv));
    ...
     data_n = unpack(texture2D(u_data_n, a_data_uv));
}    Example
As a small example, we will try to generate a dynamic system of 1,048,576 particles, known as a Strange Attractor . Frame processing consists of several stages:


Compute stage
At the stage of calculations, we will have only one independent passage responsible for the positioning of particles. It is based on a simple formula:
    Xn+1 = sin(a * Yn) - cos(b * Xn)
    Yn+1 = sin(c * Xn) - cos(d * Yn)This system is also called Peter de Jong Attractors . Over time, we will only change the coefficients.
// Вершинный шейдер
attribute vec2 a_vertex_xy;
varying vec2 v_uv;
void main()
{
    gl_Position = vec4(a_vertex_xy, 0.0, 1.0);
    v_uv = a_vertex_xy * 0.5 + 0.5;
}// Фрагментный шейдер
precision highp float;
varying vec2 v_uv;
uniform lowp float u_attractor_a;
uniform lowp float u_attractor_b;
uniform lowp float u_attractor_c;
uniform lowp float u_attractor_d;
vec4 pack(vec2 value)
{
    vec2 shift = vec2(255.0, 1.0);
    vec2 mask = vec2(0.0, 1.0 / 255.0);
    vec4 result = fract(value.xxyy * shift.xyxy);
    return result - result.xxzz * mask.xyxy;
}
void main()
{
    vec2 pos = v_uv * 4.0 - 2.0;
    for(int i = 0; i < 3; ++i)
    {
        pos = vec2(sin(u_attractor_a * pos.y) - cos(u_attractor_b * pos.x),
                   sin(u_attractor_c * pos.x) - cos(u_attractor_d * pos.y));
    }
    pos = clamp(pos, vec2(-2.0), vec2(2.0));
    gl_FragColor = pack(pos * 0.25 + 0.5);
}Renderer stage
In the scene rendering stage, we will render our particles with regular sprites.
// Вершинный шейдер
// индекс
attribute vec2 a_positions_uv;
// позиции (буфер, полученный на этапе вычислений)
uniform sampler2D u_positions;
varying vec4 v_color;
vec2 unpack(vec4 value)
{
    vec2 shift = vec2(0.00392156863, 1.0);
    return vec2(dot(value.xy, shift), dot(value.zw, shift));
}
void main()
{
    vec2 position = unpack(texture2D(u_positions, a_positions_uv));
    gl_Position = vec4(position * 2.0 - 1.0, 0.0, 1.0);
    v_color = vec4(0.8);
}// Фрагментый шейдер
precision lowp float;
varying vec4 v_color;
void main()
{
    gl_FragColor = v_color;
}Result

Postprocessing
In conclusion, at the post-processing stage , we will impose several
Gradient mapping effects . Adds color content based on the brightness of the original image.
Result

Bloom . Adds a little glow.
Result

The code
Project repository on GitHub .
Currently available:
- The main code base (C ++ 11 / C ++ 14) along with shaders;
- Examples and demos of applications;- Liquid Fluid simulation;
- Light Scattered. Adaptation of the effect of scattered light;
- Strange Attractors. The example described in the article;
- Wind field. Implementation of Navier-Stokes with a large number of particles (2 ^ 20);
- Flame Simulation. Flame simulation.
 
- Client for Android.
Continuation
In this example, we see that the stage of calculations, the stage of rendering the scene, and post-processing consist of several passes dependent on each other.
In the next part, we will try to consider the implementation of multi-pass rendering, taking into account the requirements imposed by each stage.
I will be glad to comments and suggestions (it is possible by mail yegorov.alex@gmail.com)
Thank you!