leventov March 5, 2013 at 12:21

Yarr - dataflow-framework (image processing) in Haskell

Sounding the situation on Reddit showed that hardly anyone seriously involved in image processing on Haskell, despite the fact that the fairly popular Repa library suggests working with images as one of the main applications. I hope the Yarr library ( documentation , github ) can change the situation .

I call the library a dataflow framework because it is generalized for processing arrays (from one-dimensional to three-dimensional) of elements of any type, including vectors of numbers, such as coordinates, complex numbers. But the main intended use is the processing of two-dimensional arrays of vectors of color components, i.e., images. The framework does not directly contain image processing algorithms, but provides a powerful infrastructure for writing them.

Essence

The framework defines a family of types - representations of arrays ( UArray). Representations are divided into 2 kinds:

Explicit , array elements in such representations are really located somewhere in memory. There are representations based on low-level GHC arrays, on the basis of pieces of raw memory from malloc - only for elements from the Storable class.
Implicit representations encapsulate a function (index -> element).

There are also a huge number of functions in three main areas:

Mappings (map), combination functions (zipWith), and convolution by kernel (convolution) take one or more arrays as input and return an implicit array.
Computing ( Load) functions translate the implicit array into an explicit one, i.e., in some form, write the first into memory.
Functions of passing ( Walk) through the array, including those with state - folding (fold).

Those familiar with the Repa library may notice that Yarr is very similar to it. The main differences:

Most functions have “component-wise” options, which is convenient when working with arrays of vectors, for example, images in the RGB format.
Strongly expanded the possibilities of folding (fold).
Yarr is often faster than Repa. For example, the Canny boundary detector on Yarr runs about 2 times faster than the detector on Repa.

Examples

Using the framework results in a mixture of imperative and functional rather than purely functional code, because literally everything is immersed in a monad IO. This is a charge for high performance.

Calculation of the histogram by lightness is one of the stages of equalization (the first arrow in the upper picture).

import Data.Yarr
import Data.Yarr.Walk
import Data.Yarr.Shape as S
import Data.Yarr.Repr.Foreign
...
computeHist :: UArray FS L Dim2 Float -> IO (UArray F L Dim1 Int)
computeHist pixelValues =
    walkP caps (mutate (S.unrolledFill n8 noTouch) incHist)
               (newEmpty 256) joinHists pixelValues
  where
    incHist :: UArray F L Dim1 Int -> Float -> IO ()
    incHist hist value = do
        let level = truncate (value * 255.99)
        c <- index hist level
        write hist level (c + 1)
    joinHists h1 h2 = do
        loadS S.fill (dzip2 (+) h1 h2) h1
        return h1

This article provides measurements of the minimum runtime from a variety of starts, per pixel of the image , on a different number of threads. The unit of measure is the processor cycle. Yarr version is 1.2.3, test configuration , general Makefile , test image (2560 × 1600, 6 MB).

The calculation of the direction and strength of the gradient at a point according to a smoothed black and white image is one of the stages of detecting boundaries (the second arrow in the upper picture).

GradientMagOrient function, code screen

I don’t think it makes sense to explain each line, examples are given to show how the Yarr code looks like on the whole.

{-# LANGUAGE QuasiQuotes #-}
...
import Data.Yarr.Convolution
...
noOrient = 0 :: Word8
posDiag  = 64 :: Word8
vert     = 128 :: Word8
negDiag  = 192 :: Word8
horiz    = 255 :: Word8
gradientMagOrient
    :: Word8
    -> UArray F L Dim2 Float
    -> FImage (VecTuple N2 Word8)
    -> IO ()
gradientMagOrient !threshLow image target =
    loadP S.fill caps delayedMagOrient target
  where
    delayedMagOrient = dzip2 magnitudeAndOrient gradientX gradientY
    -- Sobel-X operator application
    gradientX =
        dConvolveLinearDim2WithStaticStencil
            [dim2St| -1  0  1
                     -2  0  2
                     -1  0  1 |]
            image
    -- Sobel-Y
    gradientY =
        dConvolveLinearDim2WithStaticStencil
            [dim2St|  1   2   1
                      0   0   0
                     -1  -2  -1 |]
            image
    magnitudeAndOrient :: Float -> Float -> VecTuple N2 Word8
    magnitudeAndOrient gX gY =
        VT_2 (mag, if mag < threshLow then noOrient else orient gX gY)
      where
        mag = floatToWord8 $ sqrt (gX * gX + gY * gY)
        orient :: Float -> Float -> Word8
        orient gX gY
            | atan_1q < 0.33 = horiz
            | atan_1q > 0.66 = vert
            | otherwise      = posDiag + diagInv
          where
            rr = gY / gX
            (r, diagInv) =
                if rr < 0 then (negate rr, negDiag - posDiag) else (rr, 0)
            -- 2nd order Taylor series of atan,
            -- see http://stackoverflow.com/a/14101194/648955
            br = 0.596227 * r
            num = br + (r * r)
            atan_1q = num / (1.0 + br + num)

Here's a comparison of the performance of border detectors entirely, on Yarr and OpenCV, the most popular image processing library. Code Yarr , on the OpenCV code . OpenCV does not support parallel processing of a single image.

Conclusion: highlighting the boundaries on a large number of ordinary images will turn out faster on OpenCV (if you scatter the images in streams), and on one giant, a couple of gigapixels, photos - faster on Haskell.

Valuable optimizations

Automatic parallelization

In the examples above, you could pay attention to functions with a suffix -Pand the first argument caps: walkP, loadP. It is they who are responsible for the “magic” acceleration of programs when setting through the console a different number of cores available to the process. capsThis is an abbreviation for getNumCapabilities- functions from the standard module Control.Concurrent. Instead, capsyou can write, for example (threads 2), with a clear result.

Tools to maximize the power of a particular processor

If the calculation on the element is quite simple, and the pipeline on the target processor is wide, you can deploy the cycles several iterations to clog the pipeline better. An example is in the 1st line of the function computeHist(histogram calculation is deployed for 8 iterations). The performance of this calculation with other parameters of the deployment, and without it at all (the scale is slightly logarithmic):

As you can see, in this case, the deployment allows you to speed up the program about three times. Intuitively, this is because there are 3 pipelines in the pipeline of my processor. With a long deployment, they work at full capacity, and without a deployment - 2 out of 3 are idle.

Yes, unfortunately, neither GHC’s own backend nor LLVM (it is supposed to use it with Yarr) actually do the loops themselves, although at least LLVM should theoretically.

On the contrary, when processing arrays of vectors, if the calculations of the individual components are independent and at the same time contain a lot of arithmetic, you can try to calculate the components separately in time (in parallel, they will also be distributed across different streams) in order to unload the pipeline and the register file. For example, on my processor, this gives a performance gain of 15% with Gaussian smoothing of a color image with a 5 × 5 core ( blur.hs ). Component variants of the calculation and iteration functions have the infix -Slices-or -SlicesSeparate-.

2D roll-up of kernel convolution cycles to reduce read operations

For example, take a Gaussian smoothing of a black-and-white image with a 3 × 3 kernel:

If we calculate the values of 4 neighboring pixels in a row, LLVM can prove that upon repeated readings, the original pixel values cannot change (since there are no write operations between them) and reuse the latter .

Above is the original two-dimensional array of pixel intensities, red, blue, yellow and green rectangles are the kernels for calculating the smoothed pixel values at the coordinates (y = 2, x = 2), (2, 3), (3, 2) and (3, 3) respectively. Multipliers in the kernel are written only for the first pixel.

If there are enough general-purpose registers in the processor, LLVM compiles this calculation with 16 read operations instead of 36 - the GVN (Global Value Numbering) optimization is triggered.

The example uses an expansion of 2 indexes horizontally and 2 vertically. Of course, this is not always optimal. The speed of such smoothing depending on the deployment steps:

In the framework there is a function for parameterized expansion of the calculation of two-dimensional arrays for both dimensions, dim2BlockFill .

disadvantages

Loops are not vectorized. As with deployments, LLVM promises, but actually doesn't vectorize anything. LLVM 3.3 announced an improvement in this direction, so in the foreseeable future we can hope for even greater acceleration of the framework. The stable backend of GHC 7.6.1 is currently considered to be LLVM 3.1.

There is no support for stream filtering and combining arrays (streaming). This feature will be in Repa 4 (version in development). It will not be implemented in Yarr, because for this you will have to rewrite the library from scratch, which you can’t find. Tighten to the side with the current architecture does not work. However, Ben Lippmeier also rewrites Repa 3 from scratch, so the motivation was found :)

Implementation

Most display / combination functions (maps, zips) and functions that convert an array from implicit to explicit ( Load) are defined in classes with several parameters. Simplistically, in the case of mappings, the 1st parameter is the type of the original array, the 2nd is the type of the output implicit array. In the case of calculations, the 1st parameter is the type of the input implicit array, the 2nd is the type of the resulting calculated explicit array. As a result, when mappings in implicit arrays, information about the "underlying" explicit sources is stored. In calculations, knowledge of the structure of the input and output arrays allows you to choose the optimal iteration method for speed.

It turns out a kind of static dual dispatch. The topic is described in more detail in the course , it is given to most of the note.

A separate post is about combating GHC slowness .

Functions with unfolding loops, for example unrolledFill, take as a step of unfolding a number not an ordinary ( Int), but raised to the level of types. In a function computeHist(from the example above) n8, this is a value of the type N8that belongs to a class Arityfrom the fixed-vector library . In general, the framework is actually based on this library, it is used very actively.

Inside the functions unrolledFill, a vector of actions of a given arity seems to be formed. GHC completely reduces the vector, after compilation there remains a flat sequence of actions, i.e. just unfolding. Amazingly, GHC even copes with the reduction of double nesting vectors with which the function is implementeddim2BlockFill !

Development

The most difficult thing in developing a framework was to come up with an architecture, a type system. I drastically changed it about 8 times, looking at my note on designing at Haskell through tears of self-irony. Abstractions flowed, even more likely erupted, the architecture did not allow to implement the necessary functions or was incompatible with the GHC optimizer, and I created a new folder and started writing from the beginning.

The first 3-4 approaches were attempts to fasten component-wise operations to Repa. When I finally realized that this was not possible, I specifically moved away from the Repa architecture as far as possible. The remaining "iterations" again brought the design framework closer to Repa, as a result, from the outside it again resembles just a Repa extension.

It is funny that Alexei Khudyakov uploaded the fixed-vector library only in November last year, that is, 2 months after I decided to write a framework (I took the theme of the cursor). But since it started in December, it turned out without much difference. You could say I was very lucky :)

Tags: