# Massively parallel image stabilization

##### Foreword

Good day! Today I decided to share with you the innermost - one of my favorite bicycles.

I’ll start from afar - for quite a long time I worked at a radio factory in Chelyabinsk, and we (in general, and now have one, I’m just not there anymore) had one mega-project: an optical-electronic module for the protection of physical objects. This is such a healthy thing on a rotary installation, with three cameras for all occasions (color - daylight, black-and-white photosensitive - for twilight, and a thermal imager - for night observation). Such a module is taken, placed on a tower 50 meters high - and you can keep the territory within a radius of 4-5 kilometers under observation day and night. I will not write details, not about that post. Who cares - they will find for themselves.

Of course, there were many interesting problems in image processing. I want to talk about one of these. Namely, how to use massively parallel computing to compensate for camera shake in real time, or why SURF is not always suitable. Welcome to cat.

##### The essence of the problem

As I mentioned earlier - modules are placed on fairly high towers. Due to the strong wind at such an altitude, they are constantly finely shaking, and the smaller the viewing angle of the camera, the more noticeable it is. And the operator should observe the camera image, and for a rather long time (shift). This is very difficult, it can be trivial to vomit. That is why one of the main requirements for such systems is the presence of a mechanism to minimize picture shaking. Somewhere gyro-stabilized platforms are used for this, but not everywhere. Most often, software stabilization is used, and most often it is SURF. But it is not suitable for this.

##### Why SURF and similar algorithms are not suitable for this

In order:

1 - SURF is not intended for this. The only thing we need to solve this problem is to find the offset of the new frame relative to the previous one, in pixels. The camera does not rotate along the image plane, and does not move closer or further to the image. SURF is more suitable for tracking moving objects in a video sequence, but not for finding the displacement of the entire image.

2 - Performance. I will not go into details, I can only say that the time limits are serious - the algorithm should find the offset between the last two frames in less than 13 milliseconds. As less as possible. From SURF, we were not even close to achieving this.

3 - Interference. The displacement of control points, and not the entire frame, is analyzed. Due to matching errors, there are problems with eliminating false matches, which are quite difficult to solve. In a live picture, this caused rare “jerks” of the image in one direction or another.

##### My bicycle

By the time I got down to solving this problem, I already had some experience with nVidia CUDA . I rewrote a couple of video processing filters - brightness, contrast. The increase in speed compared to processing on the CPU made me incredibly pleased, it was from this time that I became interested in massively parallel computing and GPGPU in particular.

It was the massive-parallel computing that I decided to use to solve this problem. At first I was looking for ready-made solutions, I tried it. I was looking for SURF implementations on CUDA, but as far as I remember, I did not find anything suitable. And he decided to invent his own. By the way , I came across an excellent article by Comrade BigObfuscator , after which I realized which way to dig.

I called the algorithm “Ratel Deshake”, and it consists of 3 stages:

- preparation of a new frame

- translation of the frame into a segment-integral representation

- search for the best match

All three stages are quite well suited for massive-parallel processing. I implemented it on CUDA, but, as I understand it, all this can be pretty well implemented on FPGAs, for example.

##### New frame preparation

At this stage, we need to remove from the input everything that does not interest us. More precisely - color, and, if possible, interference. For this, I used the translation of the image in black and white and the selection of contours with a threshold. If in the form of code, then something like this:

```
struct pixel {
unsigned char R, G, B;
}
pixel Image[width, height];
unsigned int GSImage[width, height];
unsigned int processedImage[width, height];
for (int x = 0; x < width; x++) {
for (int y = 0; y < height; y++) {
GSImage[x, y] = Image[x, y].R + Image[x, y].G + Image[x, y].B;
}
}
for (int x = 1; x < width; x++) {
for (int y = 1; y < height; y++) {
preprocessedImage[x, y] = 0;
if (GSImage[x, y] - GSImage[x - 1, y] > threshold) preprocessedImage[x, y]++;
if (GSImage[x, y] - GSImage[x, y - 1] > threshold) preprocessedImage[x, y]++;
if (GSImage[x, y] - GSImage[x + 1, y] > threshold) preprocessedImage[x, y]++;
if (GSImage[x, y] - GSImage[x, y + 1] > threshold) preprocessedImage[x, y]++;
}
}
```

##### Translation of a frame into a segment-integral representation

Perhaps the most interesting stage. First you need to clarify some points.

First, what is the integral representation of an image. This can be understood by reading this article here , if you did not do this from the link above. In short, this is a two-dimensional (in our case) array, in which an element with coordinates X and Y is equal to the sum of all elements of the original array with coordinates [0..X, 0..Y].

Secondly - what I call a segment-integral representation. Due to the features of my algorithm, we need a two-dimensional array in which an element with coordinates X and Y is equal to the sum of all elements of the original array with coordinates [(XN) .. X, (YM) .. Y], where N and M are arbitrary numbers affecting the speed / accuracy ratio of the algorithm. For example, I used the values 15 and 15, i.e. the element in the resulting array was the sum of 256 elements (16x16 square) of the original array. Yes, in fact it is a kind of blurring of the image.

But back to the integral representation of the image. With a little thought, you can understand that in one stream it can be optimally calculated as follows:

```
unsigned int Image[width, height];
unsigned int processedImage[width, height];
processedImage[0, 0] = Image[0, 0];
for (int x = 1; x < width; x++) {
processedImage[x, 0] = Image[x, 0] + processedImage[x-1, 0];
}
for (int y = 1; y < height; y++) {
processedImage[0, y] = Image[0, y] + processedImage[0, y-1];
}
for (int x = 1; x < width; x++) {
for (int y = 1; y < height; y++) {
processedImage[x, y] = Image[x,y]+processedImage[x-1,y]+processedImage[x,y-1]-processedImage[x-1,y-1];
}
}
```

But there is one problem. It will not work to break this part of the algorithm into a large number of threads. Or will it work out? Of course. It is enough to divide it into two separate stages - horizontally and vertically. Then each of these steps perfectly parallels the number of threads equal to the number of pixels horizontally and vertically, respectively. Something like this:

```
unsigned int Image[width, height];
unsigned int horizontalImage[width, height];
unsigned int processedImage[width, height];
//по горизонтали
for (int y = 0; y < height; y++) {
horizontalImage[0, y] = Image[0, y];
for (int x = 1; x < width; x++) {
horizontalImage[x, y] = Image[x, y] + horizontalImage[x-1, y];
}
}
//по вертикали
for (int x = 0; x < width; x++) {
processedImage[x, 0] = horizontalImage[x, 0];
for (int y = 1; y < height; y++) {
processedImage[x, y] =horizontalImage[x, y] + processedImage[x, y-1];
}
}
```

And now the same thing, but for the segment-integral representation with N = M = 15:

```
unsigned int Image[width, height];
unsigned int horizontalImage[width, height];
unsigned int processedImage[width, height];
//по горизонтали
for (int y = 0; y < height; y++) {
horizontalImage[0, y] = Image[0, y];
for (int x = 1; x < 16; x++) {
horizontalImage[x, y] = Image[x, y] + horizontalImage[x-1, y];
}
for (int x = 16; x < width; x++) {
horizontalImage[x, y] = Image[x, y] + horizontalImage[x-1, y] - Image[x-16, y];
}
}
//по вертикали
for (int x = 0; x < width; x++) {
processedImage[x, 0] = horizontalImage[x, 0]
for (int y = 1; y < 16; y++) {
processedImage[x, y] = horizontalImage[x, y] + processedImage[x, y-1];
}
for (int y = 16; y < height; y++) {
processedImage[x, y] = horizontalImage[x, y] + processedImage[x, y-1] - horizontalImage[x, y-16];
}
}
```

Such a little witchcraft. In fact, you can parallelize even more if you divide the image into several separate parts, but these are separate dances with a tambourine. In any case, after all this, we have a segment-integral representation of the image. At this stage, one more important thing needs to be done - in addition to choosing N and M, you need to choose how many “blocks” will be compared.

This point is also worth explaining in detail. The fact is that when searching for the optimal match, the exact match of the previous frame with the new one is searched, and not vice versa. After passing the iteration of the algorithm - for the next iteration, data about the previous frame remains, but not all, but only the values of a certain number of blocks (in my case, having a size of 16x16). This is another factor affecting the speed / accuracy ratio. Fewer blocks - more speed, more blocks - more accuracy.

Specifically, in my case, the image size was 704x576 pixels. From the segment-integral representation of the

__previous__frame, I left the values of only 1140 segments (38x30 blocks from the center of the image, not overlapping, 4560 bytes). It will be clearer as a picture:

Here we have a hypothetical 256x256 pixel image. For convenience, it is divided into separate blocks of 16x16 pixels (although in the segment-integral representation of the blocks are much larger, and they overlap each other). In the center, an array of blocks is allocated, 10 by 10 pieces. It is the value of these blocks that must be remembered for use in the next iteration of the algorithm. In code, it will look something like this:

```
unsigned int IntegralImage[256, 256];
unsigned int NextIteration[10, 10];
for (int x = 0; x < 10; x++) {
for (int y = 0; y < 10; y++) {
NextIteration[x, y] = IntegralImage[(x+4)*16-1, (y+4)*16-1];
}
}
```

Why do we need this array - it will become clear in the next step.

##### Search for the best match

Almost everything, the finish line. If only one frame was passed through the algorithm, this step must be skipped (there is nothing to stabilize so far). If this is not the first frame, then the NextIteration array should remain from the previous iteration. Since it should remain from the

__previous__iteration - at this stage it will be called PrevIteration, do not mix it up.

So, we have a segment-integral representation of a new frame and an array of control blocks from the previous one. At this stage, you just need to find the position at which the superposition of the blocks of the previous frame on the segment-integral representation of the new frame will give a minimum of differences.

An important note - it makes little sense to check all possible positions, it is better to set the maximum expected deviation in pixels as a constant. For example, plus or minus 30 pixels in X and Y.

First of all, we need an array that displays the difference at one or another offset. An example in the code for a picture with squares:

```
unsigned int IntegralImage[256, 256];
unsigned int PrevIteration[10, 10];
unsigned int Differences[61, 61];
for (int X = 0; X < 61; X++) {
for (int Y = 0; Y < 61; Y++) {
Differences[X, Y] = 0;
}
}
//При максимальном ожидаемом отклонении в +-30 пикселей эта часть параллелится на 3721 поток
for (int X = 0; X < 61; X++) {
for (int Y = 0; Y < 61; Y++) {
for (int x = 0; x < 10; x++) {
for (int y = 0; y < 10; y++) {
Differences[X, Y] += abs(PrevIteration[x,y]-IntegralImage[(x+4)*16-31+X,(y+4)*16-31+Y];
}
}
}
}
```

That's all. It remains only to find the minimum element in the Differences array. If it is [0, 30], then the new frame has shifted 30 pixels to the right. If [60, 30] - then 30 pixels to the left.

##### Total

In this article I did not say anything innovative or very unobvious. But this is a significant piece of my experience on this topic, and I sincerely hope that it will be useful to someone. The article came out long and confusing, so for sure there are obvious mistakes somewhere, or I didn’t explain something clearly enough - I will be glad to receive feedback, constructive criticism and any other feedback. Many thanks to those who mastered to the end.

As for my implementation, a more or less working test software turned out. On a video card of the GeForce GTX560 level, it was possible to carry out the assigned task - to process 75 frames per second in size 704x576. And there was still a decent margin of time. Of course, in the article I set out only general principles - much can be optimized here. Unfortunately, it was decided not to use CUDA in the project, so I have not touched it for more than a year.

Once again, I express my gratitude to the comrade BigObfuscator for an excellent article, without which I would have come up with at least something more viable.

##### Raw

As I wrote above, I have not been implementing this algorithm for more than a year. There is a source, but it is so unsightly that I am ashamed to post it in its current form. If a sufficient number of people will be interested - I can brush off the dust from the files, put everything in order and put it out.

Have a nice coding and good ideas!