
3D object recognition methods for unmanned vehicles. Yandex Report
Unmanned cars can not do without understanding what is around and where exactly. In December last year, the developer Viktor Otliga vitonka made a presentation on the detection of 3D objects at the Data-Christmas tree . Victor works in the direction of unmanned vehicles Yandex, in the group handling the traffic situation (and also teaches at the ShAD). He explained how we solve the problem of recognizing other road users in a three-dimensional point cloud, how this problem differs from recognizing objects in an image, and how to benefit from sharing different types of sensors.
- Hello! My name is Victor Otliga, I work in the Yandex office in Minsk, and I am developing unmanned vehicles. Today I will talk about a rather important task for drones - the recognition of 3D objects around us.

To ride, you need to understand what is around. I will briefly tell you which sensors and sensors are used on unmanned vehicles and which we use. I’ll tell you what the task of detecting 3D objects is and how to measure the quality of detection. Then I will tell you what this quality can be measured on. And then I will make a brief review of good modern algorithms, including those on which our solutions are based. And in the end - small results, a comparison of these algorithms, including ours.

That’s what our working prototype of an unmanned car looks like now. Such a taxi can be rented by anyone without a driver in the city of Innopolis in Russia, as well as in Skolkovo. And if you look closely, there’s a big die on top. What is there inside?

Inside a simple set of sensors. There is a GNSS and GSM antenna to determine where the car is and to communicate with the outside world. Where without such a classic sensor as a camera. But today we will be interested in lidars.


Lidar produces approximately such a cloud of points around itself, which have three coordinates. And you have to work with them. I'll tell you how, using a camera image and a lidar cloud, to recognize any objects.

What is the challenge? The image from the camera enters the input, the camera is synchronized with the lidar. It would be strange to use the picture from the camera a second ago, take the lidar cloud from a completely different moment and try to recognize objects on it.

We somehow synchronize cameras and lidars, this is a separate difficult task, but we successfully cope with it. Such data enter the input, and in the end we want to get boxes, bounding boxes that limit the object: pedestrians, cyclists, cars and other road users and not only.
The task was set. How will we evaluate it?

The problem of 2D recognition of objects in an image has been widely studied.

You can use standard metrics or their analogues. There is a Jacquard coefficient or intersection over union, a wonderful coefficient that shows how well we detected an object. We can take a box where, as we assume, the object is located, and a box where it is actually located. Count this metric. There are standard thresholds - let's say for cars they often take a threshold of 0.7. If this value is greater than 0.7, we believe that we have successfully detected the object, that the object is there. We are great, we can go further.
In addition, in order to detect an object and understand that it is somewhere, we would like to take some kind of confidence that we really see the object there, and measure it too. You can measure simple, consider average accuracy. You can take the precision recall curve and the area under it and say: the larger it is, the much better.

Usually, to measure the quality of 3D detection, they take a dataset and divide it into several parts, because objects can be close or farther, they can be partially obscured by something else. Therefore, the validation sample is often divided into three parts. Objects that are easy to detect, of medium complexity and complex, distant or that are heavily obscured. And they measure separately in three parts. And in the results of the comparison, we will also take such a partition.

You can measure quality as in 3D, an analog of intersection over union, but not the ratio of areas, but, for example, volumes. But an unmanned car, as a rule, doesn’t really care what’s going on in the Z coordinate. We can take a bird’s eye view from above and take some kind of metric as if we were watching it all in 2D. Man is navigated more or less in 2D, and an unmanned vehicle is the same. How tall the box is is not very important.

What to measure?

Probably everyone who at least somehow faced the task of detecting in 3D by the lidar cloud heard about such a dataset as KITTI.

In some cities in Germany, a dataset was recorded, a car equipped with sensors went, it had GPS sensors, and cameras, and lidars. Then it was marked out about 8000 scenes, and was divided into two parts. One part is training, on which everyone can train, and the second is validation, in order to measure results. The KITTI validation sample is considered a quality measure. Firstly, there is a leader board on the KITTI dataset site, you can send your decision there, your results on the validation dataset, and compare with the decisions of other market players or researchers. But also this dataset is available publicly, you can download, not tell anyone, check your own, compare with competitors, but do not publicly upload.

External datasets are good, you don’t have to spend your time and resources on them, but as a rule, a car that traveled to Germany can be equipped with completely different sensors. And it’s always good to have your own internal dataset. Moreover, it is harder to expand an external dataset at the expense of others, but it’s easier to manage your own. Therefore, we use the wonderful Yandex.Tolok service.

We finalized our special task system. To the user who wants to help with the markup and get a reward for this, we give out a picture from the camera, give out a lidar cloud that you can rotate, zoom in, zoom out, and ask him to put boxes that limit our bounding boxes so that a car or a pedestrian gets into them , or something different. Thus, we collect internal sampling for personal use.
Suppose we have decided which task we will solve, how we will assume that we did it good or bad. We took somewhere the data.
What are the algorithms? Let's start with 2D. The task of 2D detection is very well known and studied.

Surely, many people know about the SSD algorithm, which is one of the state of the art methods for detecting 2D objects, and in principle, we can assume that in some way the problem of detecting objects in the image is quite well solved. If anything, we can use these results as some kind of additional information.
But our lidar cloud has its own characteristics that greatly distinguish it from the image. Firstly, it is very sparse. If the picture is a dense structure, the pixels are close, everything is dense, then the cloud is very thin, there are not so many points, and it does not have a regular structure. Purely physically there are much more points near there than in the distance, and the further you go, the fewer points there are, the less accuracy there is, the more difficult it is to determine something.
Well, the points, in principle, from the cloud come in an incomprehensible order. No one guarantees that one point will always be earlier than another. They come in relatively random order. You can somehow agree to sort them or reorder them in advance, and only then submit models to the input, but this will be quite inconvenient, you need to spend time to change them, and so on.
We would like to come up with a system that will be invariant to our problems, will solve all these problems. Fortunately, last year CVPR presented such a system. There was such an architecture - PointNet. How does she work?

A cloud of n points arrives at the entrance, each with three coordinates. Then each point is somehow standardized by a special small transform. Further it is driven through a fully connected network in order to enrich these points with signs. Then again the transformation occurs, and at the end it is enriched additionally. At some point, n points are obtained, but each one has approximately 1024 features, they are somehow standardized. But so far we have not solved the problem regarding the invariance of shifts, turns, and so on. Here it is proposed to do max-pooling, take the maximum among the points on each channel and get some vector of 1024 signs, which will be some descriptor of our cloud, which will contain information about the entire cloud. And then with this descriptor you can do many different things.

For example, you can glue it to the descriptors of individual points and solve the segmentation problem, for each point to determine which object it belongs to. It is just a road or a person or a car. And here are the results from the article.

You may notice that this algorithm does a very good job. In particular, I really like that little table in which some of the data about the countertop was thrown out, and he nevertheless determined where the legs are and where the countertop is. And this algorithm, in particular, can be used as a brick to build further systems.
One approach that uses this is the Frustum PointNets approach or the truncated pyramid approach. The idea is something like this: let's recognize objects in 2D, we are good at doing this.

Then, knowing how the camera works, we can estimate in which area the object of interest to us, the machine, can lie. To project, cut out only this area, and already on it solve the problem of finding an interesting object, for example, a machine. This is much easier than looking for any number of cars across the cloud. Searching for one car exactly in the same cloud seems to be much clearer and more efficient.

The architecture looks something like this. First, we somehow select the regions that interest us, in each region we do segmentation, and then we solve the problem of finding a bounding box that limits the object of interest to us.

The approach has proven itself. In the pictures you can see that it works quite well, but it also has drawbacks. The approach is two-tier, because of this it can be slow. We need to first apply networks and recognize 2D objects, then cut, and then solve the problem of segmentation and allocation of the bounding box on a piece of the cloud, so it can work a little slowly.
Another approach. Why don't we turn our cloud into some kind of structure that looks like a picture? The idea is this: let's look at it from above and sample our lidar cloud. We get cubes of spaces.

Inside each cube we got some points. We can count some features on them, but we can use PointNet, which for each piece of space will count some kind of descriptor. We will get a voxel, each voxel has a characteristic description, and it will more or less look like a dense structure, like a picture. We can already make different architectures, for example, SSD-like architecture for detecting objects.

The latter approach, which was one of the very first approaches to combining data from multiple sensors. It would be a sin to use only lidar data when we also have camera data. One of these approaches is called the Multi-View 3D Object Detection Network. His idea is this: feed three channels of input data to the input of a large network.

This is a picture from the camera and, in two versions, a lidar cloud: from above, with a bird's-eye view, and some kind of front view, what we see in front of us. We submit this to the input of the neuron, and it will configure everything within itself, will give us the final result - the object.
I want to compare these models. On the KITTI dataset, on validation drives, quality is evaluated as a percentage in average precision.

You may notice that F-PointNet works quite well and fast enough, beats everyone else in different areas - at least according to the authors.
Our approach is based on more or less all the ideas that I have listed. If you compare, you get about the following picture. If we do not occupy the first place, then at least the second. Moreover, on those objects that are difficult to detect, we break out into the leaders. And most importantly, our approach is fast enough. This means that it is already quite well applicable for realtime-systems, and it’s especially important for an unmanned vehicle to monitor what is happening on the road and to highlight all these objects.

In conclusion - an example of our detector:
It can be seen that the situation is complicated: some of the objects are closed, some are not visible to the camera. Pedestrians, cyclists. But the detector copes well enough. Thanks!
- Hello! My name is Victor Otliga, I work in the Yandex office in Minsk, and I am developing unmanned vehicles. Today I will talk about a rather important task for drones - the recognition of 3D objects around us.

To ride, you need to understand what is around. I will briefly tell you which sensors and sensors are used on unmanned vehicles and which we use. I’ll tell you what the task of detecting 3D objects is and how to measure the quality of detection. Then I will tell you what this quality can be measured on. And then I will make a brief review of good modern algorithms, including those on which our solutions are based. And in the end - small results, a comparison of these algorithms, including ours.

That’s what our working prototype of an unmanned car looks like now. Such a taxi can be rented by anyone without a driver in the city of Innopolis in Russia, as well as in Skolkovo. And if you look closely, there’s a big die on top. What is there inside?

Inside a simple set of sensors. There is a GNSS and GSM antenna to determine where the car is and to communicate with the outside world. Where without such a classic sensor as a camera. But today we will be interested in lidars.


Lidar produces approximately such a cloud of points around itself, which have three coordinates. And you have to work with them. I'll tell you how, using a camera image and a lidar cloud, to recognize any objects.

What is the challenge? The image from the camera enters the input, the camera is synchronized with the lidar. It would be strange to use the picture from the camera a second ago, take the lidar cloud from a completely different moment and try to recognize objects on it.

We somehow synchronize cameras and lidars, this is a separate difficult task, but we successfully cope with it. Such data enter the input, and in the end we want to get boxes, bounding boxes that limit the object: pedestrians, cyclists, cars and other road users and not only.
The task was set. How will we evaluate it?

The problem of 2D recognition of objects in an image has been widely studied.

Link from the slide
You can use standard metrics or their analogues. There is a Jacquard coefficient or intersection over union, a wonderful coefficient that shows how well we detected an object. We can take a box where, as we assume, the object is located, and a box where it is actually located. Count this metric. There are standard thresholds - let's say for cars they often take a threshold of 0.7. If this value is greater than 0.7, we believe that we have successfully detected the object, that the object is there. We are great, we can go further.
In addition, in order to detect an object and understand that it is somewhere, we would like to take some kind of confidence that we really see the object there, and measure it too. You can measure simple, consider average accuracy. You can take the precision recall curve and the area under it and say: the larger it is, the much better.

Link from the slide
Usually, to measure the quality of 3D detection, they take a dataset and divide it into several parts, because objects can be close or farther, they can be partially obscured by something else. Therefore, the validation sample is often divided into three parts. Objects that are easy to detect, of medium complexity and complex, distant or that are heavily obscured. And they measure separately in three parts. And in the results of the comparison, we will also take such a partition.

You can measure quality as in 3D, an analog of intersection over union, but not the ratio of areas, but, for example, volumes. But an unmanned car, as a rule, doesn’t really care what’s going on in the Z coordinate. We can take a bird’s eye view from above and take some kind of metric as if we were watching it all in 2D. Man is navigated more or less in 2D, and an unmanned vehicle is the same. How tall the box is is not very important.

What to measure?

Probably everyone who at least somehow faced the task of detecting in 3D by the lidar cloud heard about such a dataset as KITTI.

Link from the slide
In some cities in Germany, a dataset was recorded, a car equipped with sensors went, it had GPS sensors, and cameras, and lidars. Then it was marked out about 8000 scenes, and was divided into two parts. One part is training, on which everyone can train, and the second is validation, in order to measure results. The KITTI validation sample is considered a quality measure. Firstly, there is a leader board on the KITTI dataset site, you can send your decision there, your results on the validation dataset, and compare with the decisions of other market players or researchers. But also this dataset is available publicly, you can download, not tell anyone, check your own, compare with competitors, but do not publicly upload.

External datasets are good, you don’t have to spend your time and resources on them, but as a rule, a car that traveled to Germany can be equipped with completely different sensors. And it’s always good to have your own internal dataset. Moreover, it is harder to expand an external dataset at the expense of others, but it’s easier to manage your own. Therefore, we use the wonderful Yandex.Tolok service.

We finalized our special task system. To the user who wants to help with the markup and get a reward for this, we give out a picture from the camera, give out a lidar cloud that you can rotate, zoom in, zoom out, and ask him to put boxes that limit our bounding boxes so that a car or a pedestrian gets into them , or something different. Thus, we collect internal sampling for personal use.
Suppose we have decided which task we will solve, how we will assume that we did it good or bad. We took somewhere the data.
What are the algorithms? Let's start with 2D. The task of 2D detection is very well known and studied.

Link from the slide
Surely, many people know about the SSD algorithm, which is one of the state of the art methods for detecting 2D objects, and in principle, we can assume that in some way the problem of detecting objects in the image is quite well solved. If anything, we can use these results as some kind of additional information.
But our lidar cloud has its own characteristics that greatly distinguish it from the image. Firstly, it is very sparse. If the picture is a dense structure, the pixels are close, everything is dense, then the cloud is very thin, there are not so many points, and it does not have a regular structure. Purely physically there are much more points near there than in the distance, and the further you go, the fewer points there are, the less accuracy there is, the more difficult it is to determine something.
Well, the points, in principle, from the cloud come in an incomprehensible order. No one guarantees that one point will always be earlier than another. They come in relatively random order. You can somehow agree to sort them or reorder them in advance, and only then submit models to the input, but this will be quite inconvenient, you need to spend time to change them, and so on.
We would like to come up with a system that will be invariant to our problems, will solve all these problems. Fortunately, last year CVPR presented such a system. There was such an architecture - PointNet. How does she work?

A cloud of n points arrives at the entrance, each with three coordinates. Then each point is somehow standardized by a special small transform. Further it is driven through a fully connected network in order to enrich these points with signs. Then again the transformation occurs, and at the end it is enriched additionally. At some point, n points are obtained, but each one has approximately 1024 features, they are somehow standardized. But so far we have not solved the problem regarding the invariance of shifts, turns, and so on. Here it is proposed to do max-pooling, take the maximum among the points on each channel and get some vector of 1024 signs, which will be some descriptor of our cloud, which will contain information about the entire cloud. And then with this descriptor you can do many different things.

Link from the slide
For example, you can glue it to the descriptors of individual points and solve the segmentation problem, for each point to determine which object it belongs to. It is just a road or a person or a car. And here are the results from the article.

Link from the slide
You may notice that this algorithm does a very good job. In particular, I really like that little table in which some of the data about the countertop was thrown out, and he nevertheless determined where the legs are and where the countertop is. And this algorithm, in particular, can be used as a brick to build further systems.
One approach that uses this is the Frustum PointNets approach or the truncated pyramid approach. The idea is something like this: let's recognize objects in 2D, we are good at doing this.

Then, knowing how the camera works, we can estimate in which area the object of interest to us, the machine, can lie. To project, cut out only this area, and already on it solve the problem of finding an interesting object, for example, a machine. This is much easier than looking for any number of cars across the cloud. Searching for one car exactly in the same cloud seems to be much clearer and more efficient.

Link from the slide
The architecture looks something like this. First, we somehow select the regions that interest us, in each region we do segmentation, and then we solve the problem of finding a bounding box that limits the object of interest to us.

The approach has proven itself. In the pictures you can see that it works quite well, but it also has drawbacks. The approach is two-tier, because of this it can be slow. We need to first apply networks and recognize 2D objects, then cut, and then solve the problem of segmentation and allocation of the bounding box on a piece of the cloud, so it can work a little slowly.
Another approach. Why don't we turn our cloud into some kind of structure that looks like a picture? The idea is this: let's look at it from above and sample our lidar cloud. We get cubes of spaces.

Link from the slide
Inside each cube we got some points. We can count some features on them, but we can use PointNet, which for each piece of space will count some kind of descriptor. We will get a voxel, each voxel has a characteristic description, and it will more or less look like a dense structure, like a picture. We can already make different architectures, for example, SSD-like architecture for detecting objects.

The latter approach, which was one of the very first approaches to combining data from multiple sensors. It would be a sin to use only lidar data when we also have camera data. One of these approaches is called the Multi-View 3D Object Detection Network. His idea is this: feed three channels of input data to the input of a large network.

Link from the slide
This is a picture from the camera and, in two versions, a lidar cloud: from above, with a bird's-eye view, and some kind of front view, what we see in front of us. We submit this to the input of the neuron, and it will configure everything within itself, will give us the final result - the object.
I want to compare these models. On the KITTI dataset, on validation drives, quality is evaluated as a percentage in average precision.

You may notice that F-PointNet works quite well and fast enough, beats everyone else in different areas - at least according to the authors.
Our approach is based on more or less all the ideas that I have listed. If you compare, you get about the following picture. If we do not occupy the first place, then at least the second. Moreover, on those objects that are difficult to detect, we break out into the leaders. And most importantly, our approach is fast enough. This means that it is already quite well applicable for realtime-systems, and it’s especially important for an unmanned vehicle to monitor what is happening on the road and to highlight all these objects.

In conclusion - an example of our detector:
It can be seen that the situation is complicated: some of the objects are closed, some are not visible to the camera. Pedestrians, cyclists. But the detector copes well enough. Thanks!