CPU vs GPU. Distance field

    Hello everybody. I already wrote about Distance Field once and cited the implementation with a “heuristic” code that gives good speed: “Honest glow and speed” .

    Why is it needed?


    DField can be used:
    • To significantly improve font quality
    • For effects like burning contour. One of the effects I cited in my previous article
    • For the “metaballs” effect, but in 2d and for any complex shapes. (maybe I will someday give an example of the implementation of this effect)
    • And at the moment I need DField for high-quality smoothing of corners and removal of small details.

    And if in the first two cases we can calculate DField in advance, then for other effects we need to calculate it in real time.
    The article will consider the most popular one, I would say the classic Chamfer distance (CDA) with a bunch of pictures explaining the principle of its operation, as well as a two-pass GPU algorithm.
    Both algorithms are implemented in demo programs on FPC .

    Chamfer classic on CPU


    Today I would like to draw attention to the classic way of calculating DField. You can play around here (source code in git). The algorithm has an unsustainable name: chamfer distance algoritm . This method has already described RPG in its topic . However, there is no explanation of why and how this code works. Why there are errors, what they are, and how to reduce them.
    upd. In RPG modification CDA algorithm which successfully fights with errors.

    Under the hood of a Chamfer


    So, we have a monochrome image from which we want to build Distance Field. This is a white dot on a black background:



    Let's start building DField. To do this, first fill our future image with + infinity for voids, and zeros for the object. Something like this:



    Then we start to run through the image (from left to right and from top to bottom), and for each pixel check its neighbors. Take a value in a neighboring pixel, add the distance to a neighbor to this value. For pixels on the right and left, this distance = 1. For pixels on the diagonal sqrt (1 * 1 + 1 * 1) = sqrt (2). We write in our pixel the minimum value between the current one and the one just calculated by the neighbor. After we do this for all the neighbors, we move on to the next pixel. We got the following picture:



    As we ran from left to right and from top to bottom, distances accumulated from previous calculations, and pixels marked with red arrows were automatically counted as “true”. Now, if we run in the opposite direction (from right to left, from bottom to top), then the missing part of the card will be filled.

    So, the main idea is to use an already calculated early distance. To do this, we do not need to check all the neighbors. It is enough to check the pixels that we have already run through:


    Red - the pixels of the neighbors are marked for the first pass (to the right, down). In blue - for the second (left, up).

    The first pass will give us a beautiful loop in one direction: the



    second will give in the other direction , and since we will record only the minimum value in both cases, in total it will look like this:


    The complexity of the algorithm is O (2 * W * H * 5)

    Now let's talk about accuracy. There are no problems in the current image, but if you try to build a path (as in this article ), the result will not be the most plausible. Let's look at the distance% 16 contour:



    With the red arrows, I marked the pronounced octogonality of the image. Why is this happening? And everything is simple, because our result accumulates, and it accumulates incorrectly:



    Our distance will be calculated from the red line: 1 + sqrt (2), while it will correctly calculate from the blue line: sqrt (2 * 2 + 1 * 1 ) = sqrt (3). If we take in the calculations not only the neighbors, but also the neighbors of the neighbors:



    The result, of course, will be better. But computational complexity increases from O (2 * W * H * 5) to O (2 * W * H * 13). But there is good news, we can throw out part of the neighbors without affecting the result:



    The fact is that the thrown out neighbors have a multiple distance and lie on the same beam. So we can throw them away, because the value from the nearest ones will correctly accumulate. The matrix of neighbors I will call the core. In the first case, we had a 3x3 core. Now the core is 5x5. Let's look at the result:



    Our octogonality has become noticeably rounder. (By the way, in Photoshop for layers there is an Outer glow effect with the precise parameter. My result exactly coincided after passing the 5x5 core. It seems they use this algorithm for this effect.)
    Difficulty for optimized 5x5 = O (2 * W * H *9)
    You can continue to increase and increase the size of the core, the quality will increase, but we will not be able to overcome one unpleasant effect. Here is the image for the 13 * 13 core:



    There are step gradients in the image. They are still associated with the notorious error, and in order to completely get rid of them we would have to create a kernel the size of two image widths, because the diagonal with sides (Many; 1) can cover only a core of size 2 * Many + 1. (various modifications of CDA are struggling with these errors, one of the options is to store the vector in the pixel to the nearest one, but I will not touch on them in the article).

    So in total, the advantages of the algorithm

    • Unlimited Distance Field (what is this we will understand when we deal with the algorithm on the GPU).
    • Linear complexity, and high speed for small cores.
    • Ease of implementation.


    Minuses

    • Poor accuracy (especially for small cores)
    • Dependence on previous calculations (no SIMD for you)

    GPU to battle


    Unfortunately, I did not find any popular algorithms that fit into the multi-threaded GPU architecture. Either my hands are not the same, or Google squeezed. Therefore, I will share with you my "bike." Fortunately, it is simple. You can play around here , the sources are in git. For the game you will need a video card that supports OpenGL3 and Rect textures.

    So, let's say it’s important for us to calculate the Distance Field within a radius of 40 pixels. The first pass we consider only the vertical distance from 40 neighbors from above and from 40 neighbors from below for each pixel. We record the minimum value (if there are no filled neighbors, we record the maximum value, 40). We get this picture:



    After that we make the second pass horizontally. Similarly, 40 neighbors on the left, 40 neighbors on the right. Knowing the distance to the neighbor, + the vertical distance from the neighbor, we easily consider the hypotenuse:



    In the result, write the minimum hypotenuse. After the second pass, a beautiful and correctly constructed picture awaits us:



    The complexity of the O (2 * W * H * (2 * D + 1)) algorithm, where W and H are the image sizes and D is the distance field distance. The distance field we get is normalized (there will be no values ​​greater than D in the image).
    The accuracy of the calculations is excellent, as errors do not accumulate:



    There are three modes in the appendix to the article: GL_R8, GL_R16, GL_R32F. If you enable GL_R8 and twist the distance, you can notice the jumping errors. Here's the thing. For an intermediate vertical DField, the texture for storage is 8 bits per pixel. Since I normalize the distance to the range [0; 1], errors arise. You need to either store the non-normalized distance (but then for an eight-bit texture it will be limited to 255 pixels), or increase the texture bit depth. For the R16 texture, these errors are smaller, and for R32F these errors are generally “absent”, because this is a float texture. Keep this in mind if you want to implement a similar distance field.

    So, the pros:

    • Absolute accuracy
    • It fits perfectly on SIMD.
    • Ease of implementation
    • The data immediately lies on the GPU!


    Minuses

    • Limited distance. We need to know what distance we need in advance.
    • The complexity of the algorithm depends on the distance
    • Data on the GPU (this may require additional time to receive, if we plan to continue working with this data on the CPU)

    Conclusions?


    I have a GF450 in my home computer. For an image of Hello world and DField = 500 pixels, my GPU manages to do 20 frames per second, which is approximately 50ms per frame. All this with excellent quality and simple transparent code. On the CPU, the 3x3 distance field core calculates about 100ms. 5x5 about 200ms. Even if you accelerate by optimizing for the CPU 4 times (which is very cool), then I will get quality noticeably worse in the same time. Use a GPU if algorithms allow it.

    Related Links


    1. sourceforge.net/projects/dfsamples - Actually examples for the article. Binaries and source codes.
    2. www.springer.com/cda/content/document/cda_downloaddocument/9780387312019-c1.pdf is an excellent analysis of both the Chamfer algorithm and its modifications. Comparison of error and runtime.
    3. www.valvesoftware.com/publications/2007/SIGGRAPH2007_AlphaTestedMagnification.pdf - Using DField in font rendering. Perhaps the most famous article.
    4. habrahabr.ru/post/215905 - the above article from RPG . It shows the application of DField well.

    Also popular now: