Optical trackers: ASEF and MOSSE

One of the important subtasks of video analytics is tracking objects on video. It is not so primitive that it was necessary to descend to the pixel-by-pixel level, but it is not so complicated as to unequivocally require a multilayered neural network for the solution. Tracking can be used as an end in itself, and as part of other algorithms:

Counting unique people who entered a certain zone or crossed the border in the frame
Identify typical parking routes and people in the store
Automatic rotation of the surveillance camera when the object is shifted

Even without looking at the literature, I can say with confidence that the best way to solve the problem is to use neural networks. In general, one could not write anything further, but it is not always possible to throw a pair of GTX 1080Ti into the task. Who cares how to track objects in the video in such cases, please under the cat. I will try not just to explain how the ASEF and MOSSE trackers work, but to take you to the decision so that the formulas seem obvious.

To begin, let's answer a preliminary question: why invent something, when can we feed a tensorflow a pack of video and leave the computer for a couple of weeks? The neural network approaches have one serious drawback: even on modern video cards, it is difficult to get from networks of good rate of fire. This is not a problem if we analyze the recorded video, but it puts a stick in the wheels if you want to work in real time. Suppose we want to process video from five cameras at 10 FPS. Even with such relatively mild conditions, the neural network should have inference time less than

$$ \ frac {1000} {5 \ times 10} = $ 20$ milliseconds (in conditions of complete non-parallelism). For comparison, YoloV3, a classifier network with a relatively simple architecture, can spit out the image in

$inline$ milliseconds In addition, solutions based on powerful graphics cards can be trite roads.

This article assumes that you have already dealt with computer image processing, are familiar with the basic operations of linear algebra (convolution, norm, matrix inversion), and understand in general terms the Fourier transform.

Hereinafter:

$A \ odot B$ denotes the elementwise multiplication of matrices $inline$ and $inline$
$A \ otimes B$ denotes matrix convolution $inline$ and $inline$
$\ hat {A} (\ omega, \ nu) = \ mathcal {F} (A (x, y))$ denotes what $\ hat {A} (\ omega, \ nu)$ - frequency matrix of the fast Fourier transform applied to the image $inline$ .
$\ parallel A \ parallel_2$ - denotes the sum of the squares of the elements of the matrix $inline$

Trivial solution

At first glance, the task of tracking a specific item does not seem so difficult.

Let us have

$inline$ consecutive video frames

$inline$ size

$inline$ on

$inline$ pixels At the opening frame of the video

$inline$ a rectangle is circled around some object

$inline$

$inline$ on

$inline$ . It is required to find the location of this body in all other frames.

$inline$ .

Remove the noise in the images, then normalize each of them in the range from -1 to 1, so that the total brightness changes do not affect the detection. Take the first frame of the video without markup

$inline$ . If a

$inline$ and

$inline$ - adjacent video frames with good FPS, then the desired object is unlikely far removed from its original position. Take advantage of this. Cut out

$inline$ rectangle

$inline$ from the place where the body was previously located. "Dragging"

$inline$ through

$inline$ and at each point we calculate the square of the sum of differences

$G_ {L_2} (i, j) = \ parallel F_ {1} (i, j) - F_0 \ parallel_ {2}, i \ in [0, m], j \ in [0, n]$

where to calculate the difference in

$G_ {L_2} (i, j)$ need to combine center

$inline$ with element

$inline$ at

$inline$ , and the missing values are zero. After that in the matrix

$G_ {L2}$ looking for a minimum; its location minus the coordinates of the center

$inline$ and will be the displacement of the desired object on

$inline$ .

In order for the abrupt transition to zero not to “ring” during detection, it is better to initially take a rectangle a little more than necessary and smoothly reduce to zero values closer to the boundaries

$inline$ and

$inline$ . To do this, each of

$inline$ need to multiply by mask. In the case of square objects, the good old exhibitor will do.

$A = \ exp {\ frac {(x - i) ^ 2 + (y - j) ^ 2} {\ sigma ^ 2}}$ (Where

$inline$ - the center of the window), but in general it is better to take the two-dimensional window of Henning.

For

$inline$ window

$inline$ taken from the position predicted on the frame

$inline$ , and so on.

Example

Instead

$inline$ norms can be used

$inline$ (

$G_ {L_ {1}} (i, j) = | F_1 (i, j) - F_0 |$ ) and minus cross-correlation of matrices (

$G_ {nCC} (i, j) = - \ sum_ {kl} {F_ {1, kl} (i, j) F_ {0, kl}}, k \ in [0, m], l \ in [ 0, n]$ ). Both are considered slightly faster than

$inline$ , but they have their own characteristics.

$inline$ non-differentiable and less sensitive to large differences in pixel values. Cross-correlation can give false positives if the sample is low contrast and there are very light or very dark areas in the image.

$inline$ -version of metrics does not have such a flaw:

$G_ {L_2} = \ sum {(F_ {1, kl} (i, j) - F_ {0, kl}) ^ 2}$

$= \ sum {(F_ {1, kl} (i, j)) ^ 2 - 2 F_ {1, kl} (i, j) F_ {0, kl} + (F_ {0, kl}) ^ 2 }$

$= \ sum {(F_ {1, kl} (i, j)) ^ 2} - \ sum {2 F_ {1, kl} (i, j) F_ {0, kl}} + \ sum {(F_ {0, kl}) ^ 2}$

$= E_ {F_1 (i, j)} + 2 G_ {nCC} (i, j) + E_ {F_0}$

$E_ {F_1 (i, j)}$ , The "energy" of the selected site on

$inline$ acts as a balancing factor (

$E_ {F_0}$ , the sum of the squares of the pixel values of the sample is the same for all window positions and has no practical significance here).

Even such a primitive algorithm does a pretty good job in the case of linear movement of objects (for example, a camera looking down on the conveyor). However, due to the simplicity of the model and performance, this tracking method has several drawbacks:

A simple linear motion of an object without changes in nature occurs infrequently. As a rule, bodies in the field of view of the camera may undergo some classes of changes. In order of increasing complexity: size increase / decrease, rotations, affine transformations, projective transformations, nonlinear transformations, changes of the object. Even if you omit object changes and nonlinear transformations, I would like the algorithm to be restored after relatively simple turns and changes in size. Obviously, the above procedure does not have this property. Probably, $inline$ it will still give a noticeable response on the object, but it will be difficult to determine the exact location of the sample, and the track will be intermittent.

Example
We showed the algorithm only one positive sample, it is difficult to say which response will give $inline$ if another similar object appears in the window. Well, if the desired object is contrasting and has a rare structure, but what if we want to follow the car in the stream of other machines? The tracker may unpredictably jump from one car to another.
At each frame, we discard all the prehistory. Probably, it should also be used somehow.
Moreover, we learn only at one point of the image. It would be better if near the correct location of the object the tracker will also give a good response. A bit counterintuitive, but think: if the filter is in the exact location of the object in the picture $inline$ gives the best value and in $inline$ - some random, then it is too sensitive to small details that can easily change. Conversely, if $inline$ and in $inline$ approximately the same good values, the filter is “hooked” on larger and, we hope, more permanent features.
With a naive implementation of the tracking procedure, for each pixel of the frame we multiply the entire window with the selected object by the corresponding part of this frame. The complexity of this operation - $inline$ . In such cases, it is not very pleasant to track even objects 50 by 50 pixels. This problem is partially solved by reducing the size of the video, but when you reduce the image to less than 240 pixels in width, even large significant details start to get lost, which makes the algorithm meaningless.

ASEF, MOSSE

Trivial approach ++?

Roll up the sleeves and try to solve the above problems.

Augment the original image. Let's apply to it some slack affine transformations. You can also add noise or change gamma. Thus, instead of one image, a microdatasset is obtained from

$inline$ pictures. There are a lot of images, but one window. So now we will not just cut a rectangle from the image, but look for some filter

$inline$ which will give a good response for everyone

$inline$ . Transform the problem into the minimization problem:

$W: \ min_W {\ parallel F ^ p - W \ parallel_2}, p \ in [1, P]$

Where

$\ parallel F ^ p - W \ parallel_2$ - the sum of squares of pixel differences between

$inline$ and the corresponding site of the exact location of the object on

$inline$ -this synthetic image created from the frame, which has a true markup.

In addition, you can sample the rectangles far from the location of the monitored object and maximize the difference shown above.

With the suggestion that the filter give a good response at points near the exact location of the object is more difficult. We know that in

$inline$ applying filter with

$inline$ with metrics should give 0, next - more, far - even more. Moreover, we do not have any preferred direction, the response should be centrally symmetric with respect to

$inline$ . It looks like we can mathematically express what the response of the filter applied to the reference images should look like! The exact view may differ depending on the specific attenuation function of the response, but everyone loves Gaussians? So suppose that

$inline$ applied to

$inline$ ideally should give a result

$G_p = 1 - \ exp {\ frac {(x - i) ^ 2 + (y - j) ^ 2} {\ sigma ^ 2}}$ . Therefore, the minimization problem turns into:

$D_p (i, j) = \ parallel F ^ p (i, j) - W \ parallel_2$

$W: \ min_W {\ parallel D_p (i, j) - G_p (i, j) \ parallel_2}, p \ in [1, P]$

Now we minimize not the response at one point, but minimize the deviation of the response from the desired one.

Wait a second ... We did it.

$P \ times m \ times n$ equations with

$m \ times n$ variables to minimize. It seems we overdid it. Let's go back a bit.

Main reception

Of all the problems, the greatest difficulty is the difficulty

$inline$ . Is it possible to come up with something here besides tricky splitting of one search window into several small ones or search in the picture in small resolution and fine-tuning at high precision?

It turns out you can! Mathematical analysis tells us that the convolution of functions in ordinary space is the multiplication of their Fourier transforms. Apply the fast Fourier transform to images, multiply their frequencies element by element, and then we can turn the result back into a matrix for

$O (mn \ log {mn})$ , which is much faster than folding matrices in an honest way. Fourier! Who would have thought! In the century of tensorflow, it can still help us in computer vision.

(This, by the way, demonstrates a general mathematical principle: if you don’t want to solve a problem in space

$inline$ transfer it to space

$inline$ decide there and bring the decision back. The workaround is often shorter than the direct one.)

As shown above, we can use cross-correlation to localize the sample in the image. But cross-correlation is a convolution with horizontally and vertically reflected

$inline$ . Mathematical analysis suggests that in this case it will be necessary to multiply the frequencies

$inline$ on complex conjugate matrix to frequency matrix

$inline$ :

$\ hat {W} (\ omega, \ nu) = \ mathcal {F} (W (x, y))$

$\ hat {F} (\ omega, \ nu) = \ mathcal {F} (F (x, y))$

$\ hat {G} _ {conv} (\ omega, \ nu) = \ mathcal {F} (G_ {conv} (x, y))$

$G_ {conv} = F \ otimes W \ rightarrow \ hat {G} _ {conv} = \ hat {F} \ odot \ hat {W} ^ *$

Where

$G_ {conv} = \ exp {\ frac {(x - i) ^ 2 + (y - j) ^ 2} {\ sigma ^ 2}}$ - function of the ideal response on the reference image. note that

$inline$ we minimize the metric, and maximize the convolutional one, so now, the more response, the better.

If we had one image, then we would find the exact filter matrix of frequencies:

$\ hat {W} ^ * = \ frac {\ hat {G} _ {conv}} {\ hat {F}}$

where in the right part we mean element by element division. But a little earlier we generated

$inline$ images from the source. We can apply them with the Fourier approach. There is no filter with frequencies that would perfectly suit all images, but you can get something pretty good. There are two ways to solve a problem:

You can find a set of ideal filters, and then average them into one. This is the way the authors of Average of Synthetic Exact Filters (ASEF) work:
$\ hat {W} ^ * = \ frac {1} {P} \ sum_ {p = 1} ^ {P} \ hat {W} ^ * _ p = \ frac {1} {P} \ sum_ {p = 1} ^ {P} {\ frac {\ hat {G} _p} {\ hat {F} ^ p}}$
Here we use the Fourier transform linearity property. By adding the frequencies, as shown above, we seem to be averaging several filter weights.
You can find filter frequencies that satisfy all the pictures on average, just like for $inline$ :
$\ hat {W} ^ *: \ min _ {\ hat {W} ^ *} {\ sum_ {p = 1} ^ {P} {\ parallel \ hat {F} ^ p \ odot \ hat {W} ^ * - \ hat {G} _p \ parallel_2}}$
To find the minimum, you need to take the derivative of the filter elements:
$\ frac {\ delta} {\ delta \ hat {W} ^ *} \ sum_ {p = 1} ^ {P} {\ parallel \ hat {F} ^ p \ odot \ hat {W} ^ * - \ hat {G} _p \ parallel_2} = 0$
The honest take of this derivative can be found in Visual Object Tracking using Adaptive Correlation Filters , which offers Minimum Output Sum of Squared Error (MOSSE) filters. The result is:
$\ hat {W} ^ * = \ frac {\ sum_ {p = 1} ^ {P} {\ hat {G} _p \ odot \ hat {F} ^ {p *}}} {\ sum_ {p = 1} ^ {P} {\ hat {F} ^ p \ odot \ hat {F} ^ {p *}}}$

Hmm, as if similar elements are involved in formulas. With

$inline$ formulas for ASEF and MOSSE completely coincide.

$\ hat {W} ^ *$ for one image can be represented as

$\ hat {W} ^ * = \ frac {\ hat {G_p}} {\ hat {F ^ p}} = \ frac {\ hat {G_p} \ odot \ hat {F} ^ {p *}} { \ hat {F ^ p} \ odot \ hat {F} ^ {p *}}$

Substitute in the formula for ASEF and get

$\ hat {W} ^ * = \ sum_ {p = 1} ^ {P} {\ frac {\ hat {G} _p \ odot \ hat {F} ^ {p *}} {\ hat {F} ^ p \ odot \ hat {F} ^ {p *}}}$

Aha Now it’s much better to see that ASEF and MOSSE differ only in the method of averaging the filters! It is argued that MOSSE generates better filters than ASEF. It sounds logical: adjusting for the entire pack of images in general is better than filter averaging.

After we received

$\ hat {W} ^ *$ , we calculate the frequency domain response

$\ hat {G} _ {conv} = \ hat {F} \ odot \ hat {W} ^ *$ then translate it into a spatial domain and look for the maximum in the resulting matrix

$inline$ . Where the maximum - there is a new position of the object.

Additional points

The sum terms in the denominators of formulas have an interesting physical meaning. $\ hat {F} ^ p \ odot \ hat {F} ^ {p *}$ Is the energy spectrum of a rectangle with $inline$ of that image.
Pay attention to the interesting symmetry. It was necessary to multiply the filter frequencies by the image frequencies in order to get a response. Now you need to multiply the frequency response to the image frequency (and normalize) to get the filter frequency.
In real life, division by division may result in division by zero, so a regularizing constant is usually added to the denominator. $\ epsilon$ . It is argued that such a regularization causes the filter to pay more attention to the low frequencies, which improves the generalizing ability.
When processing a real video, you usually want to save information about the monitored object obtained from previous frames. When moving to the next frame, you can not calculate $\ hat {W}$ from scratch, and update the previous one. Formula update for ASEF:
$\ hat {W} ^ * _ i = \ frac {\ eta} {P} \ sum_ {p = 1} ^ {P} {\ frac {\ hat {G_p}} {\ hat {F ^ p}}} + (1 - \ eta) \ hat {W} ^ * _ {i-1}$
For MOSSE, you need to accumulate the numerator and denominator separately:
$A_i = \ eta \ sum_ {p = 1} ^ {P} \ hat {G} _p \ odot \ hat {F} ^ {p *} + (1 - \ eta) A_ {i-1}$
$B_i = \ eta \ sum_ {p = 1} ^ {P} \ hat {F} ^ p \ odot \ hat {F} ^ {p *} + (1 - \ eta) B_ {i-1}$
$\ hat {W} ^ * _ i = \ frac {A_i} {B_i}$
Where $\ eta$ - learning speed.
It is important to remember that the Fourier transform is not exactly the same as the calculation $inline$ honestly, as described at the beginning of the article. When calculating the FFT, the missing elements are not zeroed out, but are substituted from the reverse side, as if the image were looping from right to left and from bottom to top. But at the very beginning of the article we have already decided to darken the edges $inline$ so this problem will not have a noticeable effect.
As mentioned above, cross-correlation has an unpleasant feature: in general, a light filter can give a strong response in the white areas of the image, even if they do not match in contrast areas. These problems are not limited. Even a single matching pixel with a strongly positive or highly negative value can interfere with the filter if the whole sample is low contrast. To smooth out this effect, in the preprocessing you need to include a non-linear transformation of image pixels, which will “press” too light and too dark areas to the middle. Because of this, the coincidence of these contrasting sections makes a stronger contribution to the metric. The ASEF and MOSSE articles use the logarithm:
$I = \ log {I + 1}$
where are the pixels $inline$ from 0 to 255. In my opinion, this is too hard, and ignores the problem of the strong response of the dark filter on the black areas. The following scheme works better:
$I = sign (I - 127) \ sqrt {| I - 127 |}$
Then comes the normalization, and it turns out that most of the elements are concentrated around zero.
How with this algorithm to determine that the tracked object disappeared from the frame? Here will help a more detailed analysis of the response received from the next frame. The creators of MOSSE offer PSR - Peak to Sidelobe Ratio. Let be $g_ {max}$ - maximum element $inline$ corresponding to the new position of the object $inline$ . Exclude from consideration the square $11 \ times 11$ around this high. We calculate the mean and standard deviation of the remaining pixels ( $\ mu_ {sl}, \ sigma_ {sl}$ ). Then
$Psr = \ frac {g_ {max} - \ mu_ {sl}} {\ sigma_ {sl}}$
If this value is above a certain threshold, then the detection is considered successful. The threshold is usually taken in the region between 3 and 10. For confident detections, PSR usually keeps above 20.

(note that PSR here does not mean what it usually means in signal processing theory; so don't google it, nothing good will come of it. )
The algorithm is extremely simple. Performing the tracking procedure on Core-i7 in pictures with 320x400 using OpenCV implementation takes from 0.5 to 2 milliseconds depending on the size of the object being tracked.

MOSSE algorithm

Let's put it all together.

General condition:

Filter frequency matrix:

$\ hat {W}$
Auxiliary matrix for calculating the frequency of the filter:

$inline$
Frequency matrix of the desired ideal response:

$\ hat {G}$
Learning speed during tracking:

$\ eta$
The rectangle of the current position of the object:

$inline$
Number of transformations:

$inline$
Response threshold:

$inline$

Auxiliary function: Training . Login: image

$inline$ current learning rate

$\ eta_ {current}$

$A_ {new}: = 0, B_ {new}: = 0$
Not yet typed transformations:
1. Apply a small random affine transformation to the image with the center in the center. $inline$
2. Cut a rectangle with an object from an image $inline$
3. Apply a mask to it to smooth the edges.
4. Translate $inline$ to the frequency domain: $\ hat {F}$
5. $A_ {new} = A_ {new} + \ hat {G} \ odot \ hat {F} ^ *$
6. $B_ {new} = B_ {new} + \ hat {F} \ odot \ hat {F} ^ *$
If a $\ eta_ {current} \ geq 1.0$ then replace $inline$ and $inline$ on $A_ {new}$ and $B_ {new}$ . Otherwise:
$B: = \ eta B_ {new} + (1 - \ eta) B$
$A: = \ eta A_ {new} + (1 - \ eta) A$
Calculate filter frequencies:
$\ hat {W} ^ * = \ frac {A} {B}$

Initialization . Login: image

$inline$ , a rectangle around the position of the object

$R_ {init}$

$R: = R_ {init}$
Prepare the desired response $inline$ . This is usually a fully zanulirovannaya matrix with a small Gaussian in the center.
Training : $inline$ , 1.0
Translate $inline$ to the frequency domain: $\ hat {G}$

Tracking : Login: image

$inline$

Cut rectangle $inline$ of $inline$ for the existing previous position of the object $inline$
Apply a mask to it to smooth the edges.
Translate $inline$ to the frequency domain: $\ hat {F}$
$\ hat {G} _ {response} = \ hat {W} \ odot \ hat {F} ^ *$
Translate $\ hat {G} _ {response}$ in the spatial domain: $G_ {response}$
Find the maximum in $G_ {response}$ : $g_ {max}, (x, y)$
Calculate the power of response $Psr: = \ frac {g_ {max} - \ mu_ {sl}} {\ sigma_ {sl}}$
If a $PSR <PSR_ {thr}$ out with failure
Update position $inline$ . Adjust R, if it went beyond the image, or it was decided that the object has increased / decreased.
Training : $inline$ , $\ eta$
Return $inline$

Implementation details may vary. For example,

Can only be preprocessing $inline$ and not the whole image.
$inline$ can be re-created for each image transformation with varying function and width of response.
It is possible to simultaneously train several different filters on several scales of an object, in order to detect movements far and near.

What it looks like

For starters, a few examples of two-dimensional Fourier transform.

Some simple examples

Напомню, что результат преобразования комплекснозначный. На картинках ниже представлены группы «изображение — абсолютные значения частотной области в обычном масштабе — абсолютные значения частотной области в логарифмическом масштабе».

Вертикальные линии:

Картинка изменяется слева направо одинаково для любого положения по вертикали. Более того, изменение периодическое, с чётким периодом и явным паттерном. Поэтому на картинках частот вы видите только частоты вдоль оси

$inline$ .

Клетка:

Обратите внимание, что есть как ожидаемые серии частот вдоль осей

$inline$ и

$inline$ , так и странные паразитные частоты. Они появились из-за того, что во-первых, картинка конечна, тогда как Фурье-образ разлагается в красивую сумму только для бесконечного периодического сигнала. Во-вторых, можно заметить, что изображение не образует точный период на краях.

Наклонные линии:

Снова видны как частоты, соответствующие основному направлению, так и паразитные частоты.

Наклонные линии плюс дисторсия:

На образе частот видно несколько характерных направлений, но интуитивно представить по ним картинку уже становится сложно.

Для изображений настоящего мира представить картинку в голове по её частотам ещё сложнее:

(частоты в центре закрыты, чтобы они не «засвечивали» остальной спектр)

We now turn to the real example of work:

Pack of pictures

Изображение с размеченным объектом:

Вырезанный и предобработанный объект, его спектр в обычном и логарифмическом масштабе (

$F, |\hat{F}|, \log{|\hat{F}|}$ ):

Желаемый отклик (

$inline$ ):

Частоты фильтра в обычном и логарифмическом масштабе (

$W, |\hat{W}|$ ):

Явно выраженные веса фильтра (без применения трансформаций

$inline$ ):

Обратите внимание, что они нигде не участвуют в алгоритме — их можно посчитать разве что ради интереса. Также обратите внимание, что фильтр выглядит как чёрте что. Можно было бы ожидать, что фильтром оказалось бы что-то похожее на изначальное изображение, но это отнюдь не всегда правда. Фильтр, похожий на само изображение вряд ли бы дал желаемую гауссиану откликом.

Отклик от следующего кадра:

Хоть он и не такой чистый, как желаемый отклик, на нём легко определить максимум.

The same example with a narrower desired response:

Pack of pictures

Уже:

$inline$ :

Ещё уже:

$inline$ :

При очень узком максимуме на фильтре вместо чёрного пятна становится явно виден глаз.

$inline$ for the previous three cases

$inline$ when used for learning 16 transformations of the input image:

More pack of pictures

Широкий максимум:

Средний максимум:

Узкий максимум:

Чем больше трансформаций, тем меньше фильтр цепляется за случайные элементы. Особенно хорошо видно, что пропали случайные чёрно-белые пятна из середины

$inline$ . С другой стороны, для узкой гауссианы обучение на нескольких картинках может сыграть в минус: посмотрите на «звон» образовавшийся в фильтре вокруг глаза.

If you want to see how it looks live, download from here my test repository with the implementation of MOSSE with the output of debug images. You can find more MOSSE options on the githaba. In addition, it is in OpenCV .

Conclusion

Thank you for your attention, habrovchane. MOSSE and ASEF tracking are not the most complex algorithms in the world. The easier it is not only to apply them effectively, but also to understand what their creators were guided by. I hope that my explanation has helped you to penetrate into the mind of researchers, to follow the course of their thoughts. This can be useful: machine learning is not a static area of knowledge, it has a place for creativity and its own research. Try to dig into some well-established algorithm: sprinkle unnecessary limbs to accelerate or add a couple so that it will work better in your particular case. You'll like it!

This article was written with the support of the company DSSL.

Tags:

Optical trackers: ASEF and MOSSE

Trivial solution

ASEF, MOSSE

Trivial approach ++?

Main reception

Additional points

MOSSE algorithm

What it looks like

Conclusion

Also popular now: