Make3D from one photo, part 1
The project from Stanford University (now Cornell University ) " Make3D ", is notable for the fact that he set himself the task, which has not yet become typical, of restoring a three-dimensional model of a scene from just one photograph. Until now, in order to achieve a similar result, developers have restored three-dimensional information by combining several (two or more) pictures of the same object. In this case, it was demonstrated that a significant amount of information is contained in monocular signs ( monocular cues ) of the image itself, which until then were often ignored. In practical implementation, it has already been possible to achieve satisfactory results by more than 60%arbitrary photographs provided and evaluated by external users of the system during its testing.
The publication consists of: Part 1, Part 2
Published to quench curiosity, in order to
Content:
Part one ( current ):
- Algorithm Summary
- List of used theories and algorithms
- Ising model or where did this formula come from in MRF
- Effective graph image segmentation
- We want 3D Model and Polygon Location
- You’ll have to look wider ...
- Monocular image features
- Borders and refractions
- MRF Input and Output
- What will MRF do? Target: min (distance error)
- The wish list is ready, cram it into the MRF Model
- 1. Local features: Preliminary determination of distance by local features
- 2. Connection: Connection
- 3. Co-planar: Coplanarity
- 4. Co-linear: Collinearity
- Briefly about MRF Training
- Briefly about MRF Solution
- Source code and model parameters studied
- Program execution, testing
- Total
- Some files
Algorithm Summary
The image is divided into superpixels , small segments of the image that are similar in texture. Homogeneous objects and even even surfaces, as a rule, are divided into several superpixels, but, nevertheless, superpixels lie entirely within the same object or surface if the surface has a pronounced border.
The algorithm for each superpixel tries to find out its position (depth) and orientation in three-dimensional space relative to the point from which the photograph was taken, i.e. find the surface (plane) on which it lies. This surface can have any location in space, not necessarily horizontal or vertical, as was assumed in similar studies conducted earlier.
The project uses preliminary training in the relationships of the totality of the set of features of superpixels (image sections) and their distances. The training algorithm uses the MRF model ( Markov field ), taking into account restrictions on the relative distances between adjacent superpixels, i.e. given that with some probability two neighboring sections are most likely at approximately the same distance from the observation point, or even they can be on the same plane than belong to different objects that are far apart in space (such as the fence and the background behind it).
By processing a segmented superpixel image in a pre-trained MRF, the algorithm receives at the output the position and orientation of each superpixel. This is enough to build a three-dimensional model of the scene, the texture of which is the photograph itself.
Pic.1 Result of processing photos in Make3D

List of used theories and algorithms
In the method of restoring a three-dimensional model of a scene, the following theories and algorithms are used:
- Image segmentation for splitting an image into superpixels
- The use of filters to the image to identify monocular signs
- Random Markov Fields ( MRF ) for modeling monocular features and the relative position of different parts of the image
- Selecting edges and borders in an image ( edge detection )
- Elements of computational geometry , linear programming , etc.
Ising model or where did this formula come from in MRF
You can scroll quickly, this is just an explanation of why the formula looks like this in MRF and in “ Make3D ”. In general, historically, it happened ...
The Ising model ( Ising model ) served as the prototype of Markovskaya , so let's start with it. This is a mathematical model of statistical physics designed to describe the magnetization of material.
In the Ising physical model, each vertex of the crystal lattice is associated with a number called a spin and a numerical one designated as “ +1 ” or “ −1 ”, which determines the direction of the spin: “ up ” or “ down ”. Spin - this is the proper angular momentum of elementary particles, which has a quantum nature and is not related to the movement of a particle as a whole.
Pic.2 One-dimensional Ising model

Let us consider the simplest one-dimensional case when the nodes of the crystal lattice are located along one straight line. The model describes the probability distribution in a space
of all possible configurations of the substance
, wherein
- the number of elementary particles,
spin i -th particle. Such a distribution is called a random field - it is a random function defined on the set of points of a multidimensional space.
= how likely is it that the substance will be in a state where all the nodes have spins equal to 
Each of the
possible spins
, the energy is attributed from the pairwise interaction of the spins of adjacent atoms and an external field:Total energy of the system in the Ising model:

In the formula, the first sum is taken over all pairs of neighboring nodes, and is equal to the interaction energy of the spins of the nodes. Ising simplified the model, taking into account the interaction only between neighboring nodes located in close proximity. Exchange interaction constant J is a characteristic of the material and describes the spin energy of interaction. If J> 0 , then it describes the attractive energy of the ferromagnet ( attractive case ), when the neighboring nodes of the lattice try to line up in one direction, i.e. collinearly , for values of J <0 it describes the repulsive energy of an antiferromagnet ( repulsive case), when neighboring nodes of the lattice try to line up in opposite directions, i.e. coplanarly . The second sum describes the effect of an external magnetic field of intensity H , where m - characterizes the magnetic properties of the substance.
In the case when J> 0 , then the first sum will be minimal, in the case when all nodes of the crystal lattice will be aligned in a collinear direction. The second sum reaches a minimum when the direction of the spin lattice nodes coincides with the direction of the external field H .
At physics, we know that any system, brought out of equilibrium, tends to the state with the least energy reserve. Therefore, sooner or later, the model will “calm down” and take a position in which the total energy
will be equal to the lowest value ( min ). This will be her most likely condition (
- max ). The probability distribution in space of
all possible configurations of matter in the Ising model obeys the distribution given by the so-called Gibbs distribution function ( Gibbs measure ):Distribution in the Ising model:

where k is the Boltzmann constant, T is the temperature of the substance, the
normalizing value. We do not need these quantities, it is important that the formula has this form and has been working since the beginning of the 19th century .
For a more convenient form of the probability distribution in the model recording associate with each i -th node of its lattice ( personal ) own the total energy:
Own energy of a node of the Ising model:

The Gibbs measure function is interesting from two perspectives. The first is related to entropy, which probably led to its widespread use in statistical studies.
Pic.3 Two-dimensional Ising model

The second important property of the distribution in this form, from the point of view of probability theory, is as follows. Let
the set of all neighbors of node j , ie, many nodes in the immediate vicinity. Then:Markov type property:

This property is called the Markov type property. The probability that the spin at the point
faces a , for given spins in all other nodes of the crystal lattice, is equal to the probability in which only spins in neighboring nodes are taken into account
. This property is called a local characteristic , and a distribution with a similar property is called a random Markov field . The Markov network was presented as a generalization of the Ising model, and in it the probability distribution function
can be written in multiplicative form , as the product of all the self-energies of the nodes of the crystal lattice:Model in multiplicative form:

In practice, the main purpose of applying a Markov random field is to detect the most likely
random field configuration
. The main difficulties associated with the use of this method are the right choice of parameters that control the strength of the spatial interaction.
Now we have a formula, on its basis the MRF model for “Make3D” will be built ...
Effective graph image segmentation
A picture is worth a thousand words:
Segmentation example with various parameters

These large monochrome areas are superpixels, in the terminology of the authors of “Make3D”. In more detail in an accessible form, see the article " Effective segmentation of images on graphs ."
We want 3D Model and Polygon Location
We want to get a 3D model . The reconstructed three-dimensional model is represented by a combination of polygons and texture , i.e. parts of the original photograph superimposed on these polygons. The polygon has an arbitrary location and orientation in space.
Pic.4 Super Pixel Location and Orientation
More formally, the three-dimensional location of the polygon will be determined by the plane specified by the parameters
. For every point
belonging to this plane α , the equality
. The shortest distance from the observation point, the point from which the photograph was taken, to the plane along the perpendicular dropped to it is: 1 / | α | . The normal vector
uniquely determines the orientation of the plane in space. If we are given a unit vector
(called a ray
) connecting the observation point ( camera center ) with an arbitrary pixel i on the image belonging to the plane α, The distance from the viewpoint to the pixel i is given by:
. The distance to the superpixel will also be determined.You’ll have to look wider ...
A person is well distinguished by monocular signs of the image, helping him to restore a three-dimensional model of the scene and determine the distance to objects. At one glance, he is able to notice differences in image textures, color gradients, color differences, fuzzy outlines and out-of-focus objects in the distance.
However, only local monocular signs are still not enough, because the blue sky and the blue object on the earth have similar local monocular signs. Therefore, when reconstructing a scene, a person has to take into account not only local signs, but also the relative relationship between the various parts of this image. So, a person can determine the distance to an object with dimly expressed signs, for example, a uniformly illuminated large gray area, relative to other objects in the photograph. And determine that it is a wall and even approximately see its length.
Therefore, in the model of MRF takes into account not only local signs of monocular ( monocular cue sheet ), but also the relative relationship between different parts of the image:
- Локальные особенности (local features): локальные особенности суперпикселя, монокулярные, во многом влияют на его положение и ориентацию в пространстве
- Соединения (connections): за исключением границ различных объектов, наиболее вероятно, что соседние суперпиксели соединены на границах и в трехмерном пространстве
- Копланарность (co-planarity): если соседние пиксели обладают схожими монокулярными особенностями (feature), и не имеют ярко выраженных границ (edge) между ними, то наиболее вероятно, что они лежат в одной и той же плоскости
- Collinearity ( co-linearity ): long straight line, most likely, it will also be long straight lines in space trehmernomi
But before moving on to the MRF model ... what is so interesting that the authors found in just one photograph?
Image Features
Superpixels For each tuple in the algorithm computed characteristics ( features ), in order to detect signs of monocular ( monocular cues ), mentioned above, as well as the calculated characteristics to determine the boundaries and refractive superpixels (objects in the image). In order to make the algorithm more robust and flexible, so that he could restore the scene in an arbitrary photograph, not necessarily similar to those on which they were trained model, calculated with respect to a large number of features ( features ).
Monocular image features
For each superpixel i , statistical features related to the image texture, shape and position are calculated. The project “Make3D” uses 17 filters
, where n = 1, ..., 17 :- 9 masks Laws for calculating an average image intensity ( averaging ), detecting spots ( spot detection ) and borders ( edge detection )
- 6 masks for identifying oriented edges ( oriented edge detection ), rotated by ( 30 ° × k ), k = 0, ..., 5 degrees
- 2 color channels in Cb and Cr in YCbCr format
Pic.5 Masks used in Make3D

Pic.6 Two channels of color Cb and Cr

The application of filters to superpixels i image I (x, y) is formed by a 34-dimensional vector features calculated as
Pic.7 Super pixel layout features

В проекте "Make3D" учитываются не только особенности отдельного суперпикселя, а так же принимается в расчет «контекст» в виде, особенностей (features) 4-х самых больших соседних суперпикселей, и данная операция повторяется в 3-х различных масштабах. Таким образом, каждый суперпиксель содержит дополнительно информацию и об изображении вокруг него, что приводит к более качественным результатам, чем если ограничиться только локальными особенностями суперпикселя. В конечном итоге с каждым суперпикселем ассоциируется вектор, который содержит 34*(4+1)*3+14=524 элемента, т.е.

Pic.8 Особенности каждого суперпикселя рассматриваются в контексте

Pic.9 Выделение особенностей изображения на различных масштабах

Pic.10 On the same level, each superpixel is connected to 4 max neighbors

Borders and refractions
To determine the boundaries, parameters
are used that determine the presence of a boundary between two superpixels (" edgel "). In view of the fact that the determination of boundaries occurs, as a rule, inaccurately, the model does not use discrete quantities, but softer conditions for
. More formally, equality
= 0 means the presence of an object boundary ( occlusion boundary ) or an infold ( unfold ) between the superpixels i and j , and
= 1 means the absence, i.e. super pixels lie on a flat surface.Pic.11 edge detection - edgels

Pic.12 edge detection

In some cases, the presence of a large gradient or difference in lighting, which are taken into account in the algorithms for determining the boundaries of objects ( edge detection ), is also a sign that these parts of the image belong to different objects. For example, the shadow of a building cast on the lawn may cause such a difference, and algorithms for determining the boundaries of objects can erroneously determine the presence of a border and divide the lawn into 2 objects. Nevertheless, if we take into account more visual features ( visual cues ) of the image to determine whether there are borders between these objects, whether they are connected, coplanar or not, and not just gradients, then better results can be obtained.
In this work, an approach was proposed for determining the boundary of objects ( edge detection ), using training that takes into account lighting, color and texture of image areas. Using their result, in “Make3D” a value
that determines the presence of a boundary between superpixels i and j is derived based on a model with the distribution:
features ( features ) superpixels pairs i and j
model parameter (after training)
Pic.13 Пример сегментации одного изображения с различными параметрами

Pic.14 y — 14-мерные вектора различия суперпикселей при различных сегментациях

In the 14-dimensional vector obtained after segmentation,
each element
is a sign: did both superpixels i and j fall into the same segment when performing the i- th segmentation. After all, two superpixels that are in the same segments in all 14 cases are most likely to be coplanar or interconnected. The conclusions about the presence of boundaries between superpixels made on the basis of multiple segmentation are more reliable than with a single segmentation. This vector
is an input parameter in determining the “soft” boundary
between two superpixels i and j .Pic.15 Parameter y (right) indicating the presence or absence of a border

MRF Input and Output
Now you can take the interim results:
The model MRF participates 5 types of settings. Input are:
- X variables related to computed image features
- Empirically calculated variables θ (parameters of the trained model)
- Parameters υ represents the “level of trust” ( confidence ) to the distance to the object, calculated based only on the local properties of the image area
- The y parameters indicate the presence or absence of a clear border between the objects in the image. These parameters are used to identify coplanarity and superpixel connections.
- Parameters α , specifying the location and orientation of the planes in space, on which the polygons are located in the three-dimensional model
Pic.16 Model MRF combines all the features of the image to find 3D

What will MRF do? Target: min (distance error)
When constructing a three-dimensional model of the scene, the relative error in determining the distance is the most significant. For true distance ( ground-truth depth )
and estimated in the model
, the relative error is defined as
. Thus, the MRF model will be constructed in order to minimize this value. Continuation: here