Optical trackers: ASEF and MOSSE
One of the important subtasks of video analytics is tracking objects on video. It is not so primitive that it was necessary to descend to the pixel-by-pixel level, but it is not so complicated as to unequivocally require a multilayered neural network for the solution. Tracking can be used as an end in itself, and as part of other algorithms:
Even without looking at the literature, I can say with confidence that the best way to solve the problem is to use neural networks. In general, one could not write anything further, but it is not always possible to throw a pair of GTX 1080Ti into the task. Who cares how to track objects in the video in such cases, please under the cat. I will try not just to explain how the ASEF and MOSSE trackers work, but to take you to the decision so that the formulas seem obvious.
To begin, let's answer a preliminary question: why invent something, when can we feed a tensorflow a pack of video and leave the computer for a couple of weeks? The neural network approaches have one serious drawback: even on modern video cards, it is difficult to get from networks of good rate of fire. This is not a problem if we analyze the recorded video, but it puts a stick in the wheels if you want to work in real time. Suppose we want to process video from five cameras at 10 FPS. Even with such relatively mild conditions, the neural network should have inference time less than
milliseconds (in conditions of complete non-parallelism). For comparison, YoloV3, a classifier network with a relatively simple architecture, can spit out the image in
milliseconds In addition, solutions based on powerful graphics cards can be trite roads.
This article assumes that you have already dealt with computer image processing, are familiar with the basic operations of linear algebra (convolution, norm, matrix inversion), and understand in general terms the Fourier transform.
Hereinafter:
At first glance, the task of tracking a specific item does not seem so difficult.
Let us have
consecutive video frames
size
on
pixels At the opening frame of the video
a rectangle is circled around some object 
on
. It is required to find the location of this body in all other frames.
.
Remove the noise in the images, then normalize each of them in the range from -1 to 1, so that the total brightness changes do not affect the detection. Take the first frame of the video without markup
. If a
and
- adjacent video frames with good FPS, then the desired object is unlikely far removed from its original position. Take advantage of this. Cut out
rectangle
from the place where the body was previously located. "Dragging"
through
and at each point we calculate the square of the sum of differences
need to combine center
with element
at
, and the missing values are zero. After that in the matrix
looking for a minimum; its location minus the coordinates of the center
and will be the displacement of the desired object on
.
In order for the abrupt transition to zero not to “ring” during detection, it is better to initially take a rectangle a little more than necessary and smoothly reduce to zero values closer to the boundaries
and
. To do this, each of
need to multiply by mask. In the case of square objects, the good old exhibitor will do.
(Where
- the center of the window), but in general it is better to take the two-dimensional window of Henning.
For
window
taken from the position predicted on the frame
, and so on.
Instead
norms can be used
(
) and minus cross-correlation of matrices (
). Both are considered slightly faster than
, but they have their own characteristics.
non-differentiable and less sensitive to large differences in pixel values. Cross-correlation can give false positives if the sample is low contrast and there are very light or very dark areas in the image.
-version of metrics does not have such a flaw:
, The "energy" of the selected site on
acts as a balancing factor (
, the sum of the squares of the pixel values of the sample is the same for all window positions and has no practical significance here).
Even such a primitive algorithm does a pretty good job in the case of linear movement of objects (for example, a camera looking down on the conveyor). However, due to the simplicity of the model and performance, this tracking method has several drawbacks:
Roll up the sleeves and try to solve the above problems.
Augment the original image. Let's apply to it some slack affine transformations. You can also add noise or change gamma. Thus, instead of one image, a microdatasset is obtained from
pictures. There are a lot of images, but one window. So now we will not just cut a rectangle from the image, but look for some filter
which will give a good response for everyone
. Transform the problem into the minimization problem:
- the sum of squares of pixel differences between
and the corresponding site of the exact location of the object on
-this synthetic image created from the frame, which has a true markup.
In addition, you can sample the rectangles far from the location of the monitored object and maximize the difference shown above.
With the suggestion that the filter give a good response at points near the exact location of the object is more difficult. We know that in
applying filter with
with metrics should give 0, next - more, far - even more. Moreover, we do not have any preferred direction, the response should be centrally symmetric with respect to
. It looks like we can mathematically express what the response of the filter applied to the reference images should look like! The exact view may differ depending on the specific attenuation function of the response, but everyone loves Gaussians? So suppose that
applied to
ideally should give a result
. Therefore, the minimization problem turns into:
Wait a second ... We did it.
equations with
variables to minimize. It seems we overdid it. Let's go back a bit.
Of all the problems, the greatest difficulty is the difficulty
. Is it possible to come up with something here besides tricky splitting of one search window into several small ones or search in the picture in small resolution and fine-tuning at high precision?
It turns out you can! Mathematical analysis tells us that the convolution of functions in ordinary space is the multiplication of their Fourier transforms. Apply the fast Fourier transform to images, multiply their frequencies element by element, and then we can turn the result back into a matrix for
, which is much faster than folding matrices in an honest way. Fourier! Who would have thought! In the century of tensorflow, it can still help us in computer vision.
(This, by the way, demonstrates a general mathematical principle: if you don’t want to solve a problem in space
transfer it to space
decide there and bring the decision back. The workaround is often shorter than the direct one.)
As shown above, we can use cross-correlation to localize the sample in the image. But cross-correlation is a convolution with horizontally and vertically reflected
. Mathematical analysis suggests that in this case it will be necessary to multiply the frequencies
on complex conjugate matrix to frequency matrix
:
- function of the ideal response on the reference image. note that
we minimize the metric, and maximize the convolutional one, so now, the more response, the better.
If we had one image, then we would find the exact filter matrix of frequencies:
images from the source. We can apply them with the Fourier approach. There is no filter with frequencies that would perfectly suit all images, but you can get something pretty good. There are two ways to solve a problem:
Hmm, as if similar elements are involved in formulas. With
formulas for ASEF and MOSSE completely coincide.
for one image can be represented as
After we received
, we calculate the frequency domain response
then translate it into a spatial domain and look for the maximum in the resulting matrix
. Where the maximum - there is a new position of the object.
Let's put it all together.
General condition:
Filter frequency matrix:
Auxiliary matrix for calculating the frequency of the filter:
Frequency matrix of the desired ideal response:
Learning speed during tracking:
The rectangle of the current position of the object:
Number of transformations:
Response threshold:
Auxiliary function: Training . Login: image
current learning rate 
Initialization . Login: image
, a rectangle around the position of the object 
Tracking : Login: image
Implementation details may vary. For example,
For starters, a few examples of two-dimensional Fourier transform.
We now turn to the real example of work:
The same example with a narrower desired response:
for the previous three cases
when used for learning 16 transformations of the input image:
If you want to see how it looks live, download from here my test repository with the implementation of MOSSE with the output of debug images. You can find more MOSSE options on the githaba. In addition, it is in OpenCV .
Thank you for your attention, habrovchane. MOSSE and ASEF tracking are not the most complex algorithms in the world. The easier it is not only to apply them effectively, but also to understand what their creators were guided by. I hope that my explanation has helped you to penetrate into the mind of researchers, to follow the course of their thoughts. This can be useful: machine learning is not a static area of knowledge, it has a place for creativity and its own research. Try to dig into some well-established algorithm: sprinkle unnecessary limbs to accelerate or add a couple so that it will work better in your particular case. You'll like it!
This article was written with the support of the company DSSL.
- Counting unique people who entered a certain zone or crossed the border in the frame
- Identify typical parking routes and people in the store
- Automatic rotation of the surveillance camera when the object is shifted
Even without looking at the literature, I can say with confidence that the best way to solve the problem is to use neural networks. In general, one could not write anything further, but it is not always possible to throw a pair of GTX 1080Ti into the task. Who cares how to track objects in the video in such cases, please under the cat. I will try not just to explain how the ASEF and MOSSE trackers work, but to take you to the decision so that the formulas seem obvious.
To begin, let's answer a preliminary question: why invent something, when can we feed a tensorflow a pack of video and leave the computer for a couple of weeks? The neural network approaches have one serious drawback: even on modern video cards, it is difficult to get from networks of good rate of fire. This is not a problem if we analyze the recorded video, but it puts a stick in the wheels if you want to work in real time. Suppose we want to process video from five cameras at 10 FPS. Even with such relatively mild conditions, the neural network should have inference time less than
This article assumes that you have already dealt with computer image processing, are familiar with the basic operations of linear algebra (convolution, norm, matrix inversion), and understand in general terms the Fourier transform.
Hereinafter:
denotes the elementwise multiplication of matrices
and
denotes matrix convolution
and
denotes what
- frequency matrix of the fast Fourier transform applied to the image
.
- denotes the sum of the squares of the elements of the matrix
Trivial solution
At first glance, the task of tracking a specific item does not seem so difficult.
Let us have
Remove the noise in the images, then normalize each of them in the range from -1 to 1, so that the total brightness changes do not affect the detection. Take the first frame of the video without markup
In order for the abrupt transition to zero not to “ring” during detection, it is better to initially take a rectangle a little more than necessary and smoothly reduce to zero values closer to the boundaries
For
Example

Instead
Even such a primitive algorithm does a pretty good job in the case of linear movement of objects (for example, a camera looking down on the conveyor). However, due to the simplicity of the model and performance, this tracking method has several drawbacks:
- A simple linear motion of an object without changes in nature occurs infrequently. As a rule, bodies in the field of view of the camera may undergo some classes of changes. In order of increasing complexity: size increase / decrease, rotations, affine transformations, projective transformations, nonlinear transformations, changes of the object. Even if you omit object changes and nonlinear transformations, I would like the algorithm to be restored after relatively simple turns and changes in size. Obviously, the above procedure does not have this property. Probably,
it will still give a noticeable response on the object, but it will be difficult to determine the exact location of the sample, and the track will be intermittent.
Example - We showed the algorithm only one positive sample, it is difficult to say which response will give
if another similar object appears in the window. Well, if the desired object is contrasting and has a rare structure, but what if we want to follow the car in the stream of other machines? The tracker may unpredictably jump from one car to another.
- At each frame, we discard all the prehistory. Probably, it should also be used somehow.
- Moreover, we learn only at one point of the image. It would be better if near the correct location of the object the tracker will also give a good response. A bit counterintuitive, but think: if the filter is in the exact location of the object in the picture
gives the best value and in
- some random, then it is too sensitive to small details that can easily change. Conversely, if
and in
approximately the same good values, the filter is “hooked” on larger and, we hope, more permanent features.
- With a naive implementation of the tracking procedure, for each pixel of the frame we multiply the entire window with the selected object by the corresponding part of this frame. The complexity of this operation -
. In such cases, it is not very pleasant to track even objects 50 by 50 pixels. This problem is partially solved by reducing the size of the video, but when you reduce the image to less than 240 pixels in width, even large significant details start to get lost, which makes the algorithm meaningless.
ASEF, MOSSE
Trivial approach ++?
Roll up the sleeves and try to solve the above problems.
Augment the original image. Let's apply to it some slack affine transformations. You can also add noise or change gamma. Thus, instead of one image, a microdatasset is obtained from
In addition, you can sample the rectangles far from the location of the monitored object and maximize the difference shown above.
With the suggestion that the filter give a good response at points near the exact location of the object is more difficult. We know that in
Wait a second ... We did it.
Main reception
Of all the problems, the greatest difficulty is the difficulty
It turns out you can! Mathematical analysis tells us that the convolution of functions in ordinary space is the multiplication of their Fourier transforms. Apply the fast Fourier transform to images, multiply their frequencies element by element, and then we can turn the result back into a matrix for
(This, by the way, demonstrates a general mathematical principle: if you don’t want to solve a problem in space
As shown above, we can use cross-correlation to localize the sample in the image. But cross-correlation is a convolution with horizontally and vertically reflected
If we had one image, then we would find the exact filter matrix of frequencies:
- You can find a set of ideal filters, and then average them into one. This is the way the authors of Average of Synthetic Exact Filters (ASEF) work:
Here we use the Fourier transform linearity property. By adding the frequencies, as shown above, we seem to be averaging several filter weights. - You can find filter frequencies that satisfy all the pictures on average, just like for
:
To find the minimum, you need to take the derivative of the filter elements:
The honest take of this derivative can be found in Visual Object Tracking using Adaptive Correlation Filters , which offers Minimum Output Sum of Squared Error (MOSSE) filters. The result is:
Hmm, as if similar elements are involved in formulas. With
After we received
Additional points
- The sum terms in the denominators of formulas have an interesting physical meaning.
Is the energy spectrum of a rectangle with
of that image.
- Pay attention to the interesting symmetry. It was necessary to multiply the filter frequencies by the image frequencies in order to get a response. Now you need to multiply the frequency response to the image frequency (and normalize) to get the filter frequency.
- In real life, division by division may result in division by zero, so a regularizing constant is usually added to the denominator.
. It is argued that such a regularization causes the filter to pay more attention to the low frequencies, which improves the generalizing ability.
- When processing a real video, you usually want to save information about the monitored object obtained from previous frames. When moving to the next frame, you can not calculate
from scratch, and update the previous one. Formula update for ASEF:
For MOSSE, you need to accumulate the numerator and denominator separately:
Where- learning speed.
- It is important to remember that the Fourier transform is not exactly the same as the calculation
honestly, as described at the beginning of the article. When calculating the FFT, the missing elements are not zeroed out, but are substituted from the reverse side, as if the image were looping from right to left and from bottom to top. But at the very beginning of the article we have already decided to darken the edges
so this problem will not have a noticeable effect.
- As mentioned above, cross-correlation has an unpleasant feature: in general, a light filter can give a strong response in the white areas of the image, even if they do not match in contrast areas. These problems are not limited. Even a single matching pixel with a strongly positive or highly negative value can interfere with the filter if the whole sample is low contrast. To smooth out this effect, in the preprocessing you need to include a non-linear transformation of image pixels, which will “press” too light and too dark areas to the middle. Because of this, the coincidence of these contrasting sections makes a stronger contribution to the metric. The ASEF and MOSSE articles use the logarithm:
where are the pixelsfrom 0 to 255. In my opinion, this is too hard, and ignores the problem of the strong response of the dark filter on the black areas. The following scheme works better:
Then comes the normalization, and it turns out that most of the elements are concentrated around zero. - How with this algorithm to determine that the tracked object disappeared from the frame? Here will help a more detailed analysis of the response received from the next frame. The creators of MOSSE offer PSR - Peak to Sidelobe Ratio. Let be
- maximum element
corresponding to the new position of the object
. Exclude from consideration the square
around this high. We calculate the mean and standard deviation of the remaining pixels (
). Then
If this value is above a certain threshold, then the detection is considered successful. The threshold is usually taken in the region between 3 and 10. For confident detections, PSR usually keeps above 20.
(note that PSR here does not mean what it usually means in signal processing theory; so don't google it, nothing good will come of it. ) - The algorithm is extremely simple. Performing the tracking procedure on Core-i7 in pictures with 320x400 using OpenCV implementation takes from 0.5 to 2 milliseconds depending on the size of the object being tracked.
MOSSE algorithm
Let's put it all together.
General condition:
Filter frequency matrix:
Auxiliary matrix for calculating the frequency of the filter:
Frequency matrix of the desired ideal response:
Learning speed during tracking:
The rectangle of the current position of the object:
Number of transformations:
Response threshold:
Auxiliary function: Training . Login: image
- Not yet typed
transformations:
- Apply a small random affine transformation to the image with the center in the center.
- Cut a rectangle with an object from an image
- Apply a mask to it to smooth the edges.
- Translate
to the frequency domain:
- Apply a small random affine transformation to the image with the center in the center.
- If a
then replace
and
on
and
. Otherwise:
- Calculate filter frequencies:
Initialization . Login: image
- Prepare the desired response
. This is usually a fully zanulirovannaya matrix with a small Gaussian in the center.
- Training :
, 1.0
- Translate
to the frequency domain:
Tracking : Login: image
- Cut rectangle
of
for the existing previous position of the object
- Apply a mask to it to smooth the edges.
- Translate
to the frequency domain:
- Translate
in the spatial domain:
- Find the maximum in
:
- Calculate the power of response
- If a
out with failure
- Update position
. Adjust R, if it went beyond the image, or it was decided that the object has increased / decreased.
- Training :
,
- Return
Implementation details may vary. For example,
- Can only be preprocessing
and not the whole image.
can be re-created for each image transformation with varying function and width of response.
- It is possible to simultaneously train several different filters on several scales of an object, in order to detect movements far and near.
What it looks like
For starters, a few examples of two-dimensional Fourier transform.
Some simple examples
Напомню, что результат преобразования комплекснозначный. На картинках ниже представлены группы «изображение — абсолютные значения частотной области в обычном масштабе — абсолютные значения частотной области в логарифмическом масштабе».
Вертикальные линии:



Картинка изменяется слева направо одинаково для любого положения по вертикали. Более того, изменение периодическое, с чётким периодом и явным паттерном. Поэтому на картинках частот вы видите только частоты вдоль оси
.
Клетка:



Обратите внимание, что есть как ожидаемые серии частот вдоль осей
и
, так и странные паразитные частоты. Они появились из-за того, что во-первых, картинка конечна, тогда как Фурье-образ разлагается в красивую сумму только для бесконечного периодического сигнала. Во-вторых, можно заметить, что изображение не образует точный период на краях.
Наклонные линии:



Снова видны как частоты, соответствующие основному направлению, так и паразитные частоты.
Наклонные линии плюс дисторсия:



На образе частот видно несколько характерных направлений, но интуитивно представить по ним картинку уже становится сложно.
Для изображений настоящего мира представить картинку в голове по её частотам ещё сложнее:



(частоты в центре закрыты, чтобы они не «засвечивали» остальной спектр)
Вертикальные линии:



Картинка изменяется слева направо одинаково для любого положения по вертикали. Более того, изменение периодическое, с чётким периодом и явным паттерном. Поэтому на картинках частот вы видите только частоты вдоль оси
Клетка:



Обратите внимание, что есть как ожидаемые серии частот вдоль осей
Наклонные линии:



Снова видны как частоты, соответствующие основному направлению, так и паразитные частоты.
Наклонные линии плюс дисторсия:



На образе частот видно несколько характерных направлений, но интуитивно представить по ним картинку уже становится сложно.
Для изображений настоящего мира представить картинку в голове по её частотам ещё сложнее:



(частоты в центре закрыты, чтобы они не «засвечивали» остальной спектр)
We now turn to the real example of work:
Pack of pictures
Изображение с размеченным объектом:

Вырезанный и предобработанный объект, его спектр в обычном и логарифмическом масштабе (
):



Желаемый отклик (
):

Частоты фильтра в обычном и логарифмическом масштабе (
):


Явно выраженные веса фильтра (без применения трансформаций
):

Обратите внимание, что они нигде не участвуют в алгоритме — их можно посчитать разве что ради интереса. Также обратите внимание, что фильтр выглядит как чёрте что. Можно было бы ожидать, что фильтром оказалось бы что-то похожее на изначальное изображение, но это отнюдь не всегда правда. Фильтр, похожий на само изображение вряд ли бы дал желаемую гауссиану откликом.
Отклик от следующего кадра:

Хоть он и не такой чистый, как желаемый отклик, на нём легко определить максимум.

Вырезанный и предобработанный объект, его спектр в обычном и логарифмическом масштабе (



Желаемый отклик (

Частоты фильтра в обычном и логарифмическом масштабе (


Явно выраженные веса фильтра (без применения трансформаций

Обратите внимание, что они нигде не участвуют в алгоритме — их можно посчитать разве что ради интереса. Также обратите внимание, что фильтр выглядит как чёрте что. Можно было бы ожидать, что фильтром оказалось бы что-то похожее на изначальное изображение, но это отнюдь не всегда правда. Фильтр, похожий на само изображение вряд ли бы дал желаемую гауссиану откликом.
Отклик от следующего кадра:

Хоть он и не такой чистый, как желаемый отклик, на нём легко определить максимум.
The same example with a narrower desired response:
Pack of pictures
Уже:

:

Ещё уже:

:

При очень узком максимуме на фильтре вместо чёрного пятна становится явно виден глаз.


Ещё уже:


При очень узком максимуме на фильтре вместо чёрного пятна становится явно виден глаз.
More pack of pictures
Широкий максимум:

Средний максимум:

Узкий максимум:

Чем больше трансформаций, тем меньше фильтр цепляется за случайные элементы. Особенно хорошо видно, что пропали случайные чёрно-белые пятна из середины
. С другой стороны, для узкой гауссианы обучение на нескольких картинках может сыграть в минус: посмотрите на «звон» образовавшийся в фильтре вокруг глаза.

Средний максимум:

Узкий максимум:

Чем больше трансформаций, тем меньше фильтр цепляется за случайные элементы. Особенно хорошо видно, что пропали случайные чёрно-белые пятна из середины
If you want to see how it looks live, download from here my test repository with the implementation of MOSSE with the output of debug images. You can find more MOSSE options on the githaba. In addition, it is in OpenCV .
Conclusion
Thank you for your attention, habrovchane. MOSSE and ASEF tracking are not the most complex algorithms in the world. The easier it is not only to apply them effectively, but also to understand what their creators were guided by. I hope that my explanation has helped you to penetrate into the mind of researchers, to follow the course of their thoughts. This can be useful: machine learning is not a static area of knowledge, it has a place for creativity and its own research. Try to dig into some well-established algorithm: sprinkle unnecessary limbs to accelerate or add a couple so that it will work better in your particular case. You'll like it!
This article was written with the support of the company DSSL.