How to turn satellite images into maps. Computer vision in Yandex
One of the main data sources for the Yandex.Maps service is satellite imagery. In order to work comfortably with the map, objects are shown on the images by polygons: forests, water reservoirs, streets, houses, etc. Usually, cartographers deal with the layout. We decided to help them and teach the computer to add polygons of houses without the participation of people.
For operations with images, the IT area is responsible, which is called computer vision. The last few years, most of the problems in this area have been successfully solved using neural networks. Today we will tell about our experience of using neural networks in mapping to readers of Habr.
First of all, we will train the neural grid, which will be engaged in semantic segmentation, that is, it will determine whether each point on the satellite image is related to the house. Why semantic segmentation, and not just the detection of objects? When the detection problem is solved, we get a set of rectangles at the output, and specific ones: two sides are vertical, two are horizontal. At home, houses are usually rotated relative to the image axes, and in some buildings there is also a complex shape.
The task of semantic segmentation is currently being solved by various networks ( FCN , SegNet , UNet , etc.). We only need to choose which one suits us best.
Having obtained the mask from the satellite image, we will select rather large clusters of points belonging to houses, assemble them into connected regions and present the boundaries of the regions in the vector form in the form of polygons.
It is clear that the mask will not be absolutely accurate, which means that nearby houses can stick together in one connected area. To cope with this problem, we decided to further train the network. She will find on the image of the rib (the borders of the houses) and divide the buildings that have stuck together.
So, loomed such a scheme:
We did not completely reject the detection networks and tried Mask R-CNN . Its plus in comparison with the usual segmentation is that Mask R-CNN detects objects and generates a mask, so you don’t have to mess around with the division of the common mask into coherent areas. But a minus (as without it) in the fixed resolution of the mask of each object, i.e., for large houses with a complex border, this border will certainly be simplified.
Then it was necessary to decide on the tools. Everything was quite obvious here: OpenCV is best suited for computer vision tasks . The choice of neural networks is somewhat broader. We settled on Tensorflow . Its advantages:
For training and other heavy calculations, they planned to use Nirvana - a wonderful Yandex platform, which we have already talked about .
Success in work with a neural network by 80 percent consists of a good dataset. So, for starters, we should have collected such a dataset. Yandex has a huge number of satellite images with already marked objects. It seems, everything is simple: it is enough to unload this data and collect it in dataset. However, there is one nuance.
When a person searches for a house in a satellite image, the first thing he sees is the roof. But the height of houses is different, the satellite can take the same terrain from different angles - and if we put a polygon on the vector map corresponding to the roof, there is no guarantee that the roof will not leave when the picture is updated. But the foundation is dug into the ground and, from whatever angle you take it off, it always remains in one place. That is why the houses on the vector Yandex.Map are marked “on the foundations”. This is correct, but for the problem of segmentation of images it is better to teach the network to look for roofs: the hope that the grid will be trained to recognize the foundations is very small. Therefore, in dataset everything should be marked on the roofs. So, to create a good datasset, we need to learn how to move the vector marking of houses from the foundations to the roofs.
We tried and did not move, but the quality was not very good, and this is understandable: the satellite shooting angles are different, the heights of the houses are different, as a result, the foundation shifted in different directions and at different distances from the roof in the photos. The network is lost from such diversity and, at best, it trains at something average, at worst - at something incomprehensible. And the network for semantic segmentation produces a result that looks like something acceptable, but when searching for edges, the quality drops dramatically.
Since we have climbed into the field of computer vision, the first thing we tried was an approach that is relevant to this computer vision itself. At first, the vector map is rasterized (the polygons of the houses are drawn with white lines on a black background), the Sobel filter highlights the edges in the satellite image. And then the two images are offset relative to each other, which maximizes the correlation between them. The edges after the Sobel filter are quite noisy, therefore, if we apply this approach to a single building, an acceptable result is not always obtained. However, the method works well in areas with buildings of the same height: if you look for an offset at once over a large portion of the image, the result will be more stable.
If the territory is built up not by the same type, but by various houses, the previous method will not work. Fortunately, sometimes we know the height of buildings on the Yandex vector map and the position of the satellite during the shooting. Thus, we can use school knowledge of geometry and count where and how far the roof will move relative to the foundation. This method has improved dataset in areas with high-rise buildings.
The most time consuming way: roll up your sleeves, uncover a mouse, stare at the monitor and manually move the vector marking of houses from the foundations to the roofs. The technique brings a result that is simply amazing in quality, but it is not recommended to use it in large quantities: developers engaged in such tasks quickly fall into apathy and lose interest in life.
As a result, we received enough satellite images, well-marked on the roofs. So, there is a chance to train the neural network (so far, however, not for segmentation, but for improving the layout of other satellite images). And we did it.
The inputs to the convolutional neural network were satellite imagery and offset rasterized markup. At the output, we obtained a two-dimensional vector: vertical and horizontal displacements.
With the help of the neural network, we found the necessary displacement, which allowed us to achieve good results on buildings that do not have a height. As a result, we have significantly reduced the correction of the markup manually.
There are many interesting territories and states on Yandex.Maps. But even in Russia, houses are extremely diverse, which affects the way they look in satellite imagery. Means, it is necessary to reflect a variety in. And initially we didn’t really understand how to deal with all this magnificence correctly. Collect a huge dataset and then train one network on it? Make your dataset for each (conditional) type of building and train a separate network? To train a certain basic network and then to test it for a specific type of building?
Experienced we found that:
Semantic segmentation is a fairly well-studied task. After the appearance of the Fully Convolutional Networks article, it is mainly solved with the help of neural networks. It remains only to choose a network (we considered FCN , SegNet and UNet ), consider whether we need additional tricks like CRF at the output, and decide how and what function of the error will be trained.
As a result, we stopped at the U-Net-like architecture with the generalized Intersection Over Union function as a function of the error. For training, satellite images and the corresponding markup (of course, rasterized) were cut into squares and collected in datasets. It turned out quite nice, and sometimes just fine.
In areas with single buildings, semantic segmentation was enough to move on to the next stage - vectorization. Where the building is dense, houses are sometimes glued together in a cohesive area. It took to separate them.
To cope with this task, you can find the edges in the image. To detect edges, we also decided to train the network (edge search algorithms that do not use neural networks are clearly a thing of the past). We trained the HED type network, which is described in Holistically-Nested Edge Detection . In the original article, the network was trained on the BSDS-500 dataset, in which all edges are marked on the images. The trained network finds all clearly defined edges: the boundaries of houses, roads, lakes, etc. This is enough to divide nearby buildings. But we decided to go further and use the same dataset for learning as for semantic segmentation, only during rasterization, not to paint over the polygons of buildings entirely, but to draw only their boundaries.
The result turned out to be so stunningly beautiful that we decided to vectorize the buildings directly along the edges obtained from the network. And it was completely successful.
Since the HED network gave excellent results on the edges, we decided to train it to find the vertices. In fact, we have a network with common weights on the convolutional layers. She had two exits at the same time: for the edges and for the vertices. As a result, we have made another version of the vectorization of buildings, and in some cases, it showed quite sane results.
Mask R-CNN is a relatively new extension of Faster R-CNN networks. Mask R-CNN searches for objects and allocates a mask for each of them. As a result, for houses we will get not only the bounding rectangles, but also a refined structure. This approach compares favorably with simple detection (we do not know how the building is located inside the rectangle) and from the usual segmentation (several houses can stick together into one, and it’s not clear how to separate them). With Mask R-CNN, you no longer need to think about additional tricks: it is enough to vectorize the mask boundary for each object and immediately get the result. There is also a minus: the mask size for the object is always fixed, that is, for large buildings the pixel marking accuracy will be low. The result of the Mask R-CNN is as follows:
We tried Mask R-CNN last and made sure that for some types of development this approach benefits others.
With all the modern architectural diversity of the house on the satellite images are still most often look like rectangles. Moreover, for the masses of the territories do not need to mark up complex polygons. But I still want to mark the houses on the map. (Well, for example, a gardener's partnership: there are usually a lot of houses, it’s not so important to manually mark, but to mark the rectangles on the map is very good.) Therefore, the first approach to vectorization was extremely simple.
It is clear that the quality of this approach is far from ideal. However, the algorithm is quite simple and in many cases it is working.
If the quality of the segmentation is good enough, you can more accurately recreate the contour of the house. In most buildings of complex shape, the angles are mostly straight, so we decided to reduce the task to the problem of constructing a polygon with orthogonal sides. Solving it, we want to achieve two goals at once: to find the most simple polygon and to repeat the shape of buildings as accurately as possible. These goals conflict with each other, so we have to introduce additional conditions: limit the minimum length of the walls, the maximum deviation from the raster area, etc.
The algorithm that first came to our mind was based on constructing a projection of points on straight lines:
This algorithm is extremely simple and quickly brings results, but still the outline of the building sometimes turns out to be quite noisy. Trying to cope with this problem, we stumbled upon a rather interesting solution to the problem, which uses a square grid in space to approximate a polygon. Briefly, the algorithm consists of three actions:
Since the required angle of grid rotation is not known in advance, we have to sort out several values, which has a bad effect on performance. However, the algorithm allows to achieve more visually beautiful results.
While we actually worked with each house separately. When the first stage is completed, you can work with the whole picture and improve the result. For this, the post processing algorithm for a set of polygons has been added. We used the following heuristics:
As a result, the following algorithm appeared:
As a result, we got a tool that can recognize buildings of various types of buildings. He helps cartographers in their difficult work: significantly speeds up the search for missing houses and filling in new, not yet processed areas. At the moment, with the help of this tool, more than 800 thousand new objects have been added to the People’s Map.
Below you will see some examples of recognition.
For operations with images, the IT area is responsible, which is called computer vision. The last few years, most of the problems in this area have been successfully solved using neural networks. Today we will tell about our experience of using neural networks in mapping to readers of Habr.
First of all, we will train the neural grid, which will be engaged in semantic segmentation, that is, it will determine whether each point on the satellite image is related to the house. Why semantic segmentation, and not just the detection of objects? When the detection problem is solved, we get a set of rectangles at the output, and specific ones: two sides are vertical, two are horizontal. At home, houses are usually rotated relative to the image axes, and in some buildings there is also a complex shape.
The task of semantic segmentation is currently being solved by various networks ( FCN , SegNet , UNet , etc.). We only need to choose which one suits us best.
Having obtained the mask from the satellite image, we will select rather large clusters of points belonging to houses, assemble them into connected regions and present the boundaries of the regions in the vector form in the form of polygons.
It is clear that the mask will not be absolutely accurate, which means that nearby houses can stick together in one connected area. To cope with this problem, we decided to further train the network. She will find on the image of the rib (the borders of the houses) and divide the buildings that have stuck together.
So, loomed such a scheme:
We did not completely reject the detection networks and tried Mask R-CNN . Its plus in comparison with the usual segmentation is that Mask R-CNN detects objects and generates a mask, so you don’t have to mess around with the division of the common mask into coherent areas. But a minus (as without it) in the fixed resolution of the mask of each object, i.e., for large houses with a complex border, this border will certainly be simplified.
Instruments
Then it was necessary to decide on the tools. Everything was quite obvious here: OpenCV is best suited for computer vision tasks . The choice of neural networks is somewhat broader. We settled on Tensorflow . Its advantages:
- a sufficiently developed set of ready-made cubes from which you can assemble your networks;
- Python API, convenient for quickly creating a network structure and for training;
- A trained network can be used in its program through a C ++ interface (very poor compared to the Python part, but quite sufficient to run ready-made networks).
For training and other heavy calculations, they planned to use Nirvana - a wonderful Yandex platform, which we have already talked about .
Dataset
Success in work with a neural network by 80 percent consists of a good dataset. So, for starters, we should have collected such a dataset. Yandex has a huge number of satellite images with already marked objects. It seems, everything is simple: it is enough to unload this data and collect it in dataset. However, there is one nuance.
Dataset update
When a person searches for a house in a satellite image, the first thing he sees is the roof. But the height of houses is different, the satellite can take the same terrain from different angles - and if we put a polygon on the vector map corresponding to the roof, there is no guarantee that the roof will not leave when the picture is updated. But the foundation is dug into the ground and, from whatever angle you take it off, it always remains in one place. That is why the houses on the vector Yandex.Map are marked “on the foundations”. This is correct, but for the problem of segmentation of images it is better to teach the network to look for roofs: the hope that the grid will be trained to recognize the foundations is very small. Therefore, in dataset everything should be marked on the roofs. So, to create a good datasset, we need to learn how to move the vector marking of houses from the foundations to the roofs.
We tried and did not move, but the quality was not very good, and this is understandable: the satellite shooting angles are different, the heights of the houses are different, as a result, the foundation shifted in different directions and at different distances from the roof in the photos. The network is lost from such diversity and, at best, it trains at something average, at worst - at something incomprehensible. And the network for semantic segmentation produces a result that looks like something acceptable, but when searching for edges, the quality drops dramatically.
Raster approach
Since we have climbed into the field of computer vision, the first thing we tried was an approach that is relevant to this computer vision itself. At first, the vector map is rasterized (the polygons of the houses are drawn with white lines on a black background), the Sobel filter highlights the edges in the satellite image. And then the two images are offset relative to each other, which maximizes the correlation between them. The edges after the Sobel filter are quite noisy, therefore, if we apply this approach to a single building, an acceptable result is not always obtained. However, the method works well in areas with buildings of the same height: if you look for an offset at once over a large portion of the image, the result will be more stable.
"Geometric" approach
If the territory is built up not by the same type, but by various houses, the previous method will not work. Fortunately, sometimes we know the height of buildings on the Yandex vector map and the position of the satellite during the shooting. Thus, we can use school knowledge of geometry and count where and how far the roof will move relative to the foundation. This method has improved dataset in areas with high-rise buildings.
"Manual" approach
The most time consuming way: roll up your sleeves, uncover a mouse, stare at the monitor and manually move the vector marking of houses from the foundations to the roofs. The technique brings a result that is simply amazing in quality, but it is not recommended to use it in large quantities: developers engaged in such tasks quickly fall into apathy and lose interest in life.
Neural network
As a result, we received enough satellite images, well-marked on the roofs. So, there is a chance to train the neural network (so far, however, not for segmentation, but for improving the layout of other satellite images). And we did it.
The inputs to the convolutional neural network were satellite imagery and offset rasterized markup. At the output, we obtained a two-dimensional vector: vertical and horizontal displacements.
With the help of the neural network, we found the necessary displacement, which allowed us to achieve good results on buildings that do not have a height. As a result, we have significantly reduced the correction of the markup manually.
Different territories - different houses
There are many interesting territories and states on Yandex.Maps. But even in Russia, houses are extremely diverse, which affects the way they look in satellite imagery. Means, it is necessary to reflect a variety in. And initially we didn’t really understand how to deal with all this magnificence correctly. Collect a huge dataset and then train one network on it? Make your dataset for each (conditional) type of building and train a separate network? To train a certain basic network and then to test it for a specific type of building?
Experienced we found that:
- Undoubtedly, it is necessary to expand the datas for different types of buildings on which it is planned to use the tool. A network trained on one type is able to allocate buildings of another type, although very poorly.
- It is better to train one large network on the entire data set. It is fairly well generalized to different territories. If you train individual networks for each type of development, the quality will either remain the same or barely increase. So to introduce different networks for different territories is meaningless. In addition, it requires more data and an additional classifier of the type of development.
- If you use old networks when new territories are added to the data, networks learn much faster. Additional training of old networks on extended data leads to approximately the same result as network training from scratch, but it takes much less time.
Solution options
Semantic segmentation
Semantic segmentation is a fairly well-studied task. After the appearance of the Fully Convolutional Networks article, it is mainly solved with the help of neural networks. It remains only to choose a network (we considered FCN , SegNet and UNet ), consider whether we need additional tricks like CRF at the output, and decide how and what function of the error will be trained.
As a result, we stopped at the U-Net-like architecture with the generalized Intersection Over Union function as a function of the error. For training, satellite images and the corresponding markup (of course, rasterized) were cut into squares and collected in datasets. It turned out quite nice, and sometimes just fine.
In areas with single buildings, semantic segmentation was enough to move on to the next stage - vectorization. Where the building is dense, houses are sometimes glued together in a cohesive area. It took to separate them.
Edge detection
To cope with this task, you can find the edges in the image. To detect edges, we also decided to train the network (edge search algorithms that do not use neural networks are clearly a thing of the past). We trained the HED type network, which is described in Holistically-Nested Edge Detection . In the original article, the network was trained on the BSDS-500 dataset, in which all edges are marked on the images. The trained network finds all clearly defined edges: the boundaries of houses, roads, lakes, etc. This is enough to divide nearby buildings. But we decided to go further and use the same dataset for learning as for semantic segmentation, only during rasterization, not to paint over the polygons of buildings entirely, but to draw only their boundaries.
The result turned out to be so stunningly beautiful that we decided to vectorize the buildings directly along the edges obtained from the network. And it was completely successful.
Vertex Detection
Since the HED network gave excellent results on the edges, we decided to train it to find the vertices. In fact, we have a network with common weights on the convolutional layers. She had two exits at the same time: for the edges and for the vertices. As a result, we have made another version of the vectorization of buildings, and in some cases, it showed quite sane results.
Mask R-CNN
Mask R-CNN is a relatively new extension of Faster R-CNN networks. Mask R-CNN searches for objects and allocates a mask for each of them. As a result, for houses we will get not only the bounding rectangles, but also a refined structure. This approach compares favorably with simple detection (we do not know how the building is located inside the rectangle) and from the usual segmentation (several houses can stick together into one, and it’s not clear how to separate them). With Mask R-CNN, you no longer need to think about additional tricks: it is enough to vectorize the mask boundary for each object and immediately get the result. There is also a minus: the mask size for the object is always fixed, that is, for large buildings the pixel marking accuracy will be low. The result of the Mask R-CNN is as follows:
We tried Mask R-CNN last and made sure that for some types of development this approach benefits others.
Vectorization
Vectoring Rectangles
With all the modern architectural diversity of the house on the satellite images are still most often look like rectangles. Moreover, for the masses of the territories do not need to mark up complex polygons. But I still want to mark the houses on the map. (Well, for example, a gardener's partnership: there are usually a lot of houses, it’s not so important to manually mark, but to mark the rectangles on the map is very good.) Therefore, the first approach to vectorization was extremely simple.
- Take the raster area corresponding to the "home".
- Find the minimum area rectangle that contains this area (for example, like this: OpenCV :: minAreaRect ). Problem solved.
It is clear that the quality of this approach is far from ideal. However, the algorithm is quite simple and in many cases it is working.
Polygons vectorization
If the quality of the segmentation is good enough, you can more accurately recreate the contour of the house. In most buildings of complex shape, the angles are mostly straight, so we decided to reduce the task to the problem of constructing a polygon with orthogonal sides. Solving it, we want to achieve two goals at once: to find the most simple polygon and to repeat the shape of buildings as accurately as possible. These goals conflict with each other, so we have to introduce additional conditions: limit the minimum length of the walls, the maximum deviation from the raster area, etc.
The algorithm that first came to our mind was based on constructing a projection of points on straight lines:
- Find the contour of the raster area corresponding to one house.
- Reduce the number of points in the contour by simplifying it, for example, using the Douglas-Pecker algorithm .
- Find the longest side in the contour. Its angle of inclination will determine the angle of the entire future orthogonal polygon.
- Build a projection from the next point of the contour to the previous side.
- Extend the side to the projection point. If the distance from the point to its projection is more than the shortest wall of the building, add the resulting segment to the contour of the building.
- Repeat steps 4 and 5 until the contour is closed.
This algorithm is extremely simple and quickly brings results, but still the outline of the building sometimes turns out to be quite noisy. Trying to cope with this problem, we stumbled upon a rather interesting solution to the problem, which uses a square grid in space to approximate a polygon. Briefly, the algorithm consists of three actions:
- Build a square grid in space centered at zero.
- On grid points that are located not farther than a certain distance from the original contour, construct various polygons.
- Select a polygon with a minimum number of vertices.
Since the required angle of grid rotation is not known in advance, we have to sort out several values, which has a bad effect on performance. However, the algorithm allows to achieve more visually beautiful results.
Improved vectorization
While we actually worked with each house separately. When the first stage is completed, you can work with the whole picture and improve the result. For this, the post processing algorithm for a set of polygons has been added. We used the following heuristics:
- Usually the walls of adjacent houses are parallel. Moreover: most often, houses can be combined into sets, within which all elements are aligned.
- If the picture has already marked the streets, then it is very likely that the sides of the polygons will be parallel to the streets.
- If the polygons intersect, then most likely it makes sense to move the walls so that the intersection disappears.
As a result, the following algorithm appeared:
- Cluster found houses by the distance between them and the angle of rotation. We average the turns of buildings in each cluster. We repeat until the position of the buildings stops changing or until the houses start to deviate too much from the initial position.
- We choose houses near the roads, we find the side that is the longest and closest to the road. We turn the house to the parallelism of the selected side and the road.
- We remove the intersection between polygons, shifting the sides of two intersecting buildings in proportion to the size of the sides.
Result
As a result, we got a tool that can recognize buildings of various types of buildings. He helps cartographers in their difficult work: significantly speeds up the search for missing houses and filling in new, not yet processed areas. At the moment, with the help of this tool, more than 800 thousand new objects have been added to the People’s Map.
Below you will see some examples of recognition.