# 3D-reconstruction of people on the photos and their animation using video. Lecture in Yandex

In the film Mission Impossible 3, the process of creating famous spy masks was shown, thanks to which some characters become indistinguishable from others. In the story, it was first required to photograph the person into whom the hero wanted to turn into, from several angles. In 2018, a simple 3D model of a person can be, if not printed, but at least created in digital form - moreover, based on just one photo. Researcher VisionLabs described in detail the process at the Yandex event “The World through the Eyes of Robots ” from the Data & Science series - with details on specific methods and formulas.

- Good day. My name is Nikolai, I work in the company VisionLabs, which is engaged in computer vision. Our main profile is face recognition, but we also have technologies that are applicable in augmented and virtual reality. In particular, we have the technology to build a 3D face in one photo, and today I will talk about it.

Let's start with a story about what it is. On the slide you can see the original photo of Jack Ma and the 3D model, built on this photo in two variations: with texture and without texture, just geometry. This is a task that we solve.

We also want to be able to animate this model, change the direction of gaze, facial expression, add facial expressions, etc. The

application is located in different areas. The most obvious is games, including VR. You can also make virtual fitting rooms - try on glasses, beards and hairstyles. You can do 3D printing, because some people are interested in personalized accessories under their face. And you can make faces for robots: both print and show on any display on the robot.

I will start with a story about how you can generate 3D faces in general, and then we move on to the 3D reconstruction task as the inverse generation task. After that, we will focus on the animation and move on to the trainings that arise in this area.

What is the task of generating faces? We would like to have some way to generate three-dimensional faces that differ in shape and expression. Here are two rows with examples. The first row shows persons who differ in form, as if belonging to different people. And below is the same person with a different expression.

One of the ways to solve the generation problem is deformable models. The leftmost face on the slide is a kind of averaged model to which we can apply deformations by adjusting the sliders. Here are three sliders. In the top row are the faces in the direction of increasing the intensity of the slider, in the bottom row - in the direction of decreasing. Thus, we will have several customizable parameters. By installing them, you can give people different forms.

An example of a deformable model is the famous Basel Face Model, built from face scans. To build a deformable model, you first need to take a few people, bring them to a special laboratory and shoot their faces with special equipment, transferring them to 3D. Then, based on this, you can make new faces.

How does it work mathematically? We can imagine a three-dimensional model of a face as a vector in a 3n-dimensional space. Here n is the number of vertices in the model, each vertex corresponds to three coordinates in 3D, and thus we get 3n coordinates.

If we have a set of scans, then each scan is represented by such a vector, and we have a set of n such vectors.

Next, we can build new faces as linear combinations of vectors from our base. In this case, we would like the coefficients to be some meaningful. Obviously, they cannot be completely arbitrary, and I will soon show why. One of the limitations can be set so that all coefficients lie in the interval from 0 to 1. This must be done, because if the coefficients are completely arbitrary, then the persons will be implausible.

Here I would like to give some probabilistic meaning to the parameters. That is, we want to look at the set of parameters and understand whether a person is likely to turn out or not. With this we want to ensure that distorted individuals meet low probabilities.

Here's how to do it. We can apply the main component method to a set of scans. At the output, we obtain the average face S0, obtain the matrix V, the set of principal components, and also obtain the variation of the data along the principal components. Then we can take a fresh look at the generation of faces, we will see faces as some average face, plus the matrix of main components multiplied by the parameter vector.

The value of the parameters is the very intensity of the sliders that I talked about on one of the early slides. And also we can assign a probabilistic value to the parameter vector. In particular, we can agree that this vector is Gaussian.

Thus, we have a method that allows us to generate 3D faces, and this generation is controlled by the following parameters. As in the previous slide, we have two sets of parameters, two vectors α id and α exp, they are the same as in the previous slide, but α id is responsible for the shape of the face, and α exp will be responsible for the emotion.

Also, there is a new vector T - vector texture. It has the same dimension as the shape vector, and each vertex in this vector has three RGB values. Similarly, a texture vector is generated using the parameter vector β. There are no formalized parameters that will be responsible for the lighting of the face and for its position, but they also exist.

Here are examples of individuals that can be generated using a deformed model. Please note that they differ in shape, skin color, and also are traced in different lighting conditions.

Now we can go to 3D reconstruction. This is called the inverse problem, because we want to choose such parameters for the deformable model so that the face we draw from it is as close as possible to the original. This slide differs from the first one in that the face on the right is completely synthetic. If on the first slide our texture was taken from a photo, here the texture was taken from a deformable model.

At the output we will have all the parameters, the slide shows α id and α exp, and we will also have lighting, texture parameters, etc.

We said that we want to ensure that the generated model was similar to the photo. This similarity is determined by the energy function. Here we just take the pixel-by-pixel difference of the images in those pixels where we consider that the face is visible. For example, if the face is rotated, overlap will occur. For example, part of the cheekbones will be closed nose. And the visibility matrix M should display such an overlap.

In essence, 3D reconstruction is to minimize this energy function. But in order to solve this minimization problem, it would be good to have initialization and regularization. Regularization is needed for an understandable reason, as we said that if we do not regularize the parameters and make them very arbitrary, then distorted faces may turn out. Initialization is necessary, because the task as a whole is complex, it has local minima, and I don’t want to deal with them.

How can I do initialization? For this you can use 68 key points of the face. Since 2013-2014, a lot of algorithms have appeared that allow detecting 68 points with fairly good accuracy, and now they are approaching the saturation of their accuracy. Therefore, we have a way to reliably detect 68 points of the face.

We can add to our function of energy a new addend, which will say that we want the projections of the same 68 points of the model to coincide with the key points of the face. We mark these points on the model, then we somehow deform the model, twist, projecting points, and make sure that the positions of the points coincide. In the left photo of the point of two colors, purple and yellow. Some points were detected by the algorithm, while others were projected from the model. On the right there is a marking of points on the model, but for points on the edge of the face, not one point is marked, but a whole line. This is done because when the face rotates, the marking of these points must change, and the point is selected with a line.

Here is the term about which I spoke, it is the coordinatewise difference of two vectors that describe the key points of the face and the key points projected from the model.

Let us return to the regularization and consider the whole problem from the perspective of the Bayesian inference. The probability that the vector α is equal to something given a known image is proportional to the product of the probability of observing an image for a given α multiplied by the probability α. If we take the negative logarithm of this expression, which we will have to minimize, we will see that the term responsible for regularization will have a specific form here. In particular, this is the second term. Recalling that earlier we made the assumption that the α vector is Gaussian, we will see that the term responsible for regularization is the sum of the squares of the parameters reduced to variations along the principal components.

So, we can write out the full function of the energy containing three terms. The first term is responsible for the texture, the difference in pixels between the generated image and the target image. The second term is responsible for the key points, and the third is responsible for regularization.

The coefficients of the terms in the minimization process are not optimized, they are simply given.

Here, the energy function is represented as a function of all parameters. α id - face shape parameters, α exp - expression parameters, β - texture parameters, p - other parameters that we talked about, but did not formalize them, these are position and lighting parameters.

Let's stop on such a remark. This energy function can be simplified. From it, you can throw away the addendum, which is responsible for the texture, and use only the information transmitted by 68 points. And this will allow to build some kind of 3D model. However, pay attention to the model profile. On the left is a model built only on key points. On the right is a model using texture when building. Please note that on the right, the profile produces a more relevant central photograph that represents the frontal view of the face.

Animation with the existing algorithm for constructing a 3D-face model works quite simply. Recall that when building a 3D model, we get two parameter vectors, one responsible for the form, the other for the expression. These user parameter vectors of the user and avatar will always have their own. The user has one vector of form parameters, the avatar has another. However, we can make them so that the vectors responsible for the expression become the same. We will take the parameters that are responsible for the user's facial expression, and simply substitute them into the avatar model. Thus, we will transfer the user's facial expression to the avatar.

Let's talk about two shifts in this area: the speed of work and the limitations of the deformable model.

The speed of work is really a problem. Minimizing the total energy function is a very computationally intensive task. In particular, it can take from 20 to 40, on average 30 seconds. It is long enough. If we build a three-dimensional model only at key points, it will turn out much faster, but the quality will suffer from this.

How to deal with this problem? You can use more resources, some people solve this problem on the GPU. Only key points can be used, but quality will suffer. And you can use machine learning methods.

We will see in order. Here is the work of 2016, in which the user's expression is transferred to a specific video, you can manage the video with the help of your face. Here, the construction of a 3D model is performed in real time using a GPU.

Here are the techniques that use machine learning. The idea is that we can first take a large database of individuals, build a 3D model for each person with a long but accurate algorithm, present each model as a set of parameters, and continue to teach the grid to predict these parameters. In particular, in this work in 2016, ResNet is used, which takes the image as input, and gives the model parameters as output.

The three-dimensional model can be presented in a different way. In this 2017 paper, the 3D model is presented not as a set of parameters, but as a set of voxels. The network predicts voxels, turning the image into some three-dimensional representation. It is worth noting that network learning options are possible, for which 3D models are not required at all.

It works as follows. Here the most important part is the layer, which can take as input the parameters of the deformable model and render the picture. It has such a wonderful property that through it you can do the reverse propagation of an error. The network accepts an image as input, predicts the parameters, feeds these parameters to the layer that renders the image, compares this image with the input image, gets an error, back propagates the error, and continues to learn. Thus, the network learns to predict the parameters of a three-dimensional model, having only images as training data. And it is very interesting.

We talked a lot about accuracy - in particular, that it suffers if we throw away some of the components of the energy function. Let's formalize what this means, how to evaluate the accuracy of 3D face reconstruction. This requires a base ground truth scans obtained using special equipment, using methods for which there are some guarantees of accuracy. If there is such a base, then we can compare our reconstructed models with ground truth. This is done simply: we consider the average distance from the vertices of our model, which we built, to the vertices in ground truth, and we normalize to the size of the scan. This needs to be done because there are different faces, some more, some less, and on a small face the error would be less, simply because the face itself is smaller. Therefore, we need a normalization.

I would like to tell about our work, it will be in workshops, there is an ECCV. We do similar things, we teach MobileNet to predict the parameters of a deformable model. As the training data, we use 3D models built for photos from 300W datasets. We estimate the accuracy on the basis of BU4DFE scans.

That's what happens. We compare our two algorithms with state of the art. The yellow curve on this graph is an algorithm that takes 30 seconds and consists in minimizing the total energy function. Here, on the X axis, is the error that we just talked about, the average distance between the vertices. On the Y axis, the proportion of images in which this error is less than that on the X axis. In this graph, the higher the curve, the better. The next curve is our network based on the MobileNet architecture. Then three works that we talked about. A network that predicts parameters and a network that predicts voxel.

We also compared our network with analogues in terms of model size and speed of operation. It’s a win here, because we’re using MobileNet, light enough.

The second challenge is the limitedness of the deformable model.

Pay attention to the left face, look at the wings of the nose. Here are the shadows on the wings of the nose. The borders of the shadows do not coincide with the borders of the nose in the photograph, thus resulting in a defect. The reason for this may be that the deformable model is in principle incapable of building the nose of the required shape, because this deformable model was obtained from scans of only 200 persons. We would like the nose to still be correct, as in the right photo. Thus, we need to somehow go beyond the deformable model.

This can be done with nonparametric deformation of the mesh. Here are three tasks that we would like to solve: modify the local part of the face, for example the nose, then embed it in the original face model, and leave everything else unchanged.

This can be done as follows. Let us return to the designation of the mesh as a vector in the 3n-dimensional space and look at the averaging operator. This is an operator that in S with a cap replaces each vertex with the average of its neighbors. Neighbors of the top are those that are connected to it by an edge.

We define some energy function that describes the position of the vertex relative to its neighbors. We want the position of the vertex relative to its neighbors to remain unchanged or at least not change much. But at the same time, we will somehow modify S. This energy function is called internal, because there will also be some external component that says that, for example, the nose should take a given shape.

Such techniques were used, for example, in the work of 2015. They did 3D-reconstruction of faces in several photos. They took several photos from the phone, received a cloud of points, and then adapted the face model to this cloud using non-parametric modification.

Beyond the deformable model, you can go in another way. Let us dwell on the action of the smoothing operator. Here, for simplicity, a two-dimensional mesh is presented to which this operator has been applied. On the model on the left there are many details, on the model on the right these details have been smoothed out. And can we do something to add details and not to remove?

For the answer, we can look at the basis of the vectors of the smoothing operator. The smoothing operator modifies the coefficients of the mesh in the expansion over this basis.

Is it necessary to solve the problem in this way? You can do it another way: just modify these coefficients in some external way. Let's just take the first few vectors of the smoothing operator and add it to our deformable model as a new set of sliders. This technique really allows you to get improvements, so it is done in the work of 2016. This concludes my report, thank you all.

- Good day. My name is Nikolai, I work in the company VisionLabs, which is engaged in computer vision. Our main profile is face recognition, but we also have technologies that are applicable in augmented and virtual reality. In particular, we have the technology to build a 3D face in one photo, and today I will talk about it.

Let's start with a story about what it is. On the slide you can see the original photo of Jack Ma and the 3D model, built on this photo in two variations: with texture and without texture, just geometry. This is a task that we solve.

We also want to be able to animate this model, change the direction of gaze, facial expression, add facial expressions, etc. The

application is located in different areas. The most obvious is games, including VR. You can also make virtual fitting rooms - try on glasses, beards and hairstyles. You can do 3D printing, because some people are interested in personalized accessories under their face. And you can make faces for robots: both print and show on any display on the robot.

I will start with a story about how you can generate 3D faces in general, and then we move on to the 3D reconstruction task as the inverse generation task. After that, we will focus on the animation and move on to the trainings that arise in this area.

What is the task of generating faces? We would like to have some way to generate three-dimensional faces that differ in shape and expression. Here are two rows with examples. The first row shows persons who differ in form, as if belonging to different people. And below is the same person with a different expression.

One of the ways to solve the generation problem is deformable models. The leftmost face on the slide is a kind of averaged model to which we can apply deformations by adjusting the sliders. Here are three sliders. In the top row are the faces in the direction of increasing the intensity of the slider, in the bottom row - in the direction of decreasing. Thus, we will have several customizable parameters. By installing them, you can give people different forms.

An example of a deformable model is the famous Basel Face Model, built from face scans. To build a deformable model, you first need to take a few people, bring them to a special laboratory and shoot their faces with special equipment, transferring them to 3D. Then, based on this, you can make new faces.

How does it work mathematically? We can imagine a three-dimensional model of a face as a vector in a 3n-dimensional space. Here n is the number of vertices in the model, each vertex corresponds to three coordinates in 3D, and thus we get 3n coordinates.

If we have a set of scans, then each scan is represented by such a vector, and we have a set of n such vectors.

Next, we can build new faces as linear combinations of vectors from our base. In this case, we would like the coefficients to be some meaningful. Obviously, they cannot be completely arbitrary, and I will soon show why. One of the limitations can be set so that all coefficients lie in the interval from 0 to 1. This must be done, because if the coefficients are completely arbitrary, then the persons will be implausible.

Here I would like to give some probabilistic meaning to the parameters. That is, we want to look at the set of parameters and understand whether a person is likely to turn out or not. With this we want to ensure that distorted individuals meet low probabilities.

Here's how to do it. We can apply the main component method to a set of scans. At the output, we obtain the average face S0, obtain the matrix V, the set of principal components, and also obtain the variation of the data along the principal components. Then we can take a fresh look at the generation of faces, we will see faces as some average face, plus the matrix of main components multiplied by the parameter vector.

The value of the parameters is the very intensity of the sliders that I talked about on one of the early slides. And also we can assign a probabilistic value to the parameter vector. In particular, we can agree that this vector is Gaussian.

Thus, we have a method that allows us to generate 3D faces, and this generation is controlled by the following parameters. As in the previous slide, we have two sets of parameters, two vectors α id and α exp, they are the same as in the previous slide, but α id is responsible for the shape of the face, and α exp will be responsible for the emotion.

Also, there is a new vector T - vector texture. It has the same dimension as the shape vector, and each vertex in this vector has three RGB values. Similarly, a texture vector is generated using the parameter vector β. There are no formalized parameters that will be responsible for the lighting of the face and for its position, but they also exist.

Here are examples of individuals that can be generated using a deformed model. Please note that they differ in shape, skin color, and also are traced in different lighting conditions.

Now we can go to 3D reconstruction. This is called the inverse problem, because we want to choose such parameters for the deformable model so that the face we draw from it is as close as possible to the original. This slide differs from the first one in that the face on the right is completely synthetic. If on the first slide our texture was taken from a photo, here the texture was taken from a deformable model.

At the output we will have all the parameters, the slide shows α id and α exp, and we will also have lighting, texture parameters, etc.

We said that we want to ensure that the generated model was similar to the photo. This similarity is determined by the energy function. Here we just take the pixel-by-pixel difference of the images in those pixels where we consider that the face is visible. For example, if the face is rotated, overlap will occur. For example, part of the cheekbones will be closed nose. And the visibility matrix M should display such an overlap.

In essence, 3D reconstruction is to minimize this energy function. But in order to solve this minimization problem, it would be good to have initialization and regularization. Regularization is needed for an understandable reason, as we said that if we do not regularize the parameters and make them very arbitrary, then distorted faces may turn out. Initialization is necessary, because the task as a whole is complex, it has local minima, and I don’t want to deal with them.

How can I do initialization? For this you can use 68 key points of the face. Since 2013-2014, a lot of algorithms have appeared that allow detecting 68 points with fairly good accuracy, and now they are approaching the saturation of their accuracy. Therefore, we have a way to reliably detect 68 points of the face.

We can add to our function of energy a new addend, which will say that we want the projections of the same 68 points of the model to coincide with the key points of the face. We mark these points on the model, then we somehow deform the model, twist, projecting points, and make sure that the positions of the points coincide. In the left photo of the point of two colors, purple and yellow. Some points were detected by the algorithm, while others were projected from the model. On the right there is a marking of points on the model, but for points on the edge of the face, not one point is marked, but a whole line. This is done because when the face rotates, the marking of these points must change, and the point is selected with a line.

Here is the term about which I spoke, it is the coordinatewise difference of two vectors that describe the key points of the face and the key points projected from the model.

Let us return to the regularization and consider the whole problem from the perspective of the Bayesian inference. The probability that the vector α is equal to something given a known image is proportional to the product of the probability of observing an image for a given α multiplied by the probability α. If we take the negative logarithm of this expression, which we will have to minimize, we will see that the term responsible for regularization will have a specific form here. In particular, this is the second term. Recalling that earlier we made the assumption that the α vector is Gaussian, we will see that the term responsible for regularization is the sum of the squares of the parameters reduced to variations along the principal components.

So, we can write out the full function of the energy containing three terms. The first term is responsible for the texture, the difference in pixels between the generated image and the target image. The second term is responsible for the key points, and the third is responsible for regularization.

The coefficients of the terms in the minimization process are not optimized, they are simply given.

Here, the energy function is represented as a function of all parameters. α id - face shape parameters, α exp - expression parameters, β - texture parameters, p - other parameters that we talked about, but did not formalize them, these are position and lighting parameters.

Let's stop on such a remark. This energy function can be simplified. From it, you can throw away the addendum, which is responsible for the texture, and use only the information transmitted by 68 points. And this will allow to build some kind of 3D model. However, pay attention to the model profile. On the left is a model built only on key points. On the right is a model using texture when building. Please note that on the right, the profile produces a more relevant central photograph that represents the frontal view of the face.

Animation with the existing algorithm for constructing a 3D-face model works quite simply. Recall that when building a 3D model, we get two parameter vectors, one responsible for the form, the other for the expression. These user parameter vectors of the user and avatar will always have their own. The user has one vector of form parameters, the avatar has another. However, we can make them so that the vectors responsible for the expression become the same. We will take the parameters that are responsible for the user's facial expression, and simply substitute them into the avatar model. Thus, we will transfer the user's facial expression to the avatar.

Let's talk about two shifts in this area: the speed of work and the limitations of the deformable model.

The speed of work is really a problem. Minimizing the total energy function is a very computationally intensive task. In particular, it can take from 20 to 40, on average 30 seconds. It is long enough. If we build a three-dimensional model only at key points, it will turn out much faster, but the quality will suffer from this.

How to deal with this problem? You can use more resources, some people solve this problem on the GPU. Only key points can be used, but quality will suffer. And you can use machine learning methods.

We will see in order. Here is the work of 2016, in which the user's expression is transferred to a specific video, you can manage the video with the help of your face. Here, the construction of a 3D model is performed in real time using a GPU.

Here are the techniques that use machine learning. The idea is that we can first take a large database of individuals, build a 3D model for each person with a long but accurate algorithm, present each model as a set of parameters, and continue to teach the grid to predict these parameters. In particular, in this work in 2016, ResNet is used, which takes the image as input, and gives the model parameters as output.

The three-dimensional model can be presented in a different way. In this 2017 paper, the 3D model is presented not as a set of parameters, but as a set of voxels. The network predicts voxels, turning the image into some three-dimensional representation. It is worth noting that network learning options are possible, for which 3D models are not required at all.

It works as follows. Here the most important part is the layer, which can take as input the parameters of the deformable model and render the picture. It has such a wonderful property that through it you can do the reverse propagation of an error. The network accepts an image as input, predicts the parameters, feeds these parameters to the layer that renders the image, compares this image with the input image, gets an error, back propagates the error, and continues to learn. Thus, the network learns to predict the parameters of a three-dimensional model, having only images as training data. And it is very interesting.

We talked a lot about accuracy - in particular, that it suffers if we throw away some of the components of the energy function. Let's formalize what this means, how to evaluate the accuracy of 3D face reconstruction. This requires a base ground truth scans obtained using special equipment, using methods for which there are some guarantees of accuracy. If there is such a base, then we can compare our reconstructed models with ground truth. This is done simply: we consider the average distance from the vertices of our model, which we built, to the vertices in ground truth, and we normalize to the size of the scan. This needs to be done because there are different faces, some more, some less, and on a small face the error would be less, simply because the face itself is smaller. Therefore, we need a normalization.

I would like to tell about our work, it will be in workshops, there is an ECCV. We do similar things, we teach MobileNet to predict the parameters of a deformable model. As the training data, we use 3D models built for photos from 300W datasets. We estimate the accuracy on the basis of BU4DFE scans.

That's what happens. We compare our two algorithms with state of the art. The yellow curve on this graph is an algorithm that takes 30 seconds and consists in minimizing the total energy function. Here, on the X axis, is the error that we just talked about, the average distance between the vertices. On the Y axis, the proportion of images in which this error is less than that on the X axis. In this graph, the higher the curve, the better. The next curve is our network based on the MobileNet architecture. Then three works that we talked about. A network that predicts parameters and a network that predicts voxel.

We also compared our network with analogues in terms of model size and speed of operation. It’s a win here, because we’re using MobileNet, light enough.

The second challenge is the limitedness of the deformable model.

Pay attention to the left face, look at the wings of the nose. Here are the shadows on the wings of the nose. The borders of the shadows do not coincide with the borders of the nose in the photograph, thus resulting in a defect. The reason for this may be that the deformable model is in principle incapable of building the nose of the required shape, because this deformable model was obtained from scans of only 200 persons. We would like the nose to still be correct, as in the right photo. Thus, we need to somehow go beyond the deformable model.

This can be done with nonparametric deformation of the mesh. Here are three tasks that we would like to solve: modify the local part of the face, for example the nose, then embed it in the original face model, and leave everything else unchanged.

This can be done as follows. Let us return to the designation of the mesh as a vector in the 3n-dimensional space and look at the averaging operator. This is an operator that in S with a cap replaces each vertex with the average of its neighbors. Neighbors of the top are those that are connected to it by an edge.

We define some energy function that describes the position of the vertex relative to its neighbors. We want the position of the vertex relative to its neighbors to remain unchanged or at least not change much. But at the same time, we will somehow modify S. This energy function is called internal, because there will also be some external component that says that, for example, the nose should take a given shape.

Such techniques were used, for example, in the work of 2015. They did 3D-reconstruction of faces in several photos. They took several photos from the phone, received a cloud of points, and then adapted the face model to this cloud using non-parametric modification.

Beyond the deformable model, you can go in another way. Let us dwell on the action of the smoothing operator. Here, for simplicity, a two-dimensional mesh is presented to which this operator has been applied. On the model on the left there are many details, on the model on the right these details have been smoothed out. And can we do something to add details and not to remove?

For the answer, we can look at the basis of the vectors of the smoothing operator. The smoothing operator modifies the coefficients of the mesh in the expansion over this basis.

Is it necessary to solve the problem in this way? You can do it another way: just modify these coefficients in some external way. Let's just take the first few vectors of the smoothing operator and add it to our deformable model as a new set of sliders. This technique really allows you to get improvements, so it is done in the work of 2016. This concludes my report, thank you all.