GIS: determining the nesting of administrative districts
- Tutorial
The task was to organize administrative centers in a clear hierarchy according to the nested doll principle, for example, Ukraine - Crimea - South Coast - Yalta, and to correct existing errors in the current database.
In this article, I will describe how I solved this problem using KML border-framing files and Postgres + Postgis.
The fact is that the database that we use for our project is not commercial (user generated, open source) and there are errors in it. For example, the most common case is that many cities are assigned to a country, but do not belong to any of its regions and regions, we call them orphaned cities .
Plus, our business is tourism, so administrative and political fragmentation of countries is not always suitable, sometimes no, no, and you have to add tourist regions manually. For example, there is no such administrative region as “Southern Coast of Crimea”, but there is such a tourist region where tourists choose where to go - they look for “houses in the South Coast” rather than “houses in Yalta, Gaspra, Gurzuf and generally somewhere there ".
The question is how to automatically find such an administrative region of the parent (Crimea) and ask all the children included in it (cities like Yalta and Sudak).
By the way: our country consists of regions, regions consist of regions, regions of subregions. Just like the death of Koscheev ...
To solve this problem, it was decided to use data with the borders of the regions, and to compare parents and children.
A database of the borders of the regions of the part of the world you are interested in can be found on the Internet. I used files for Google Earth - either KML (XML with coordinates) or KMZ (KML in the zip archive), they are convenient to watch directly in Google Earth.
KMZ could not be automatically turned into KML - Unicode characters in toponymy names break down, so I opened KMZ files in Google Earth and saved it as KML. Of course, “it is unworthy of a talented person to spend hours, like a slave, on calculations, which, of course, could be entrusted to any person, if one were to use a machine”, but if there were more files, he would definitely look for an automation tool.
The second way, if you need custom areas like the South Coast, is hand-drawn in Google Earth and saved in a KML file.
In the data files that I had, there were distinct regions of each region, but there were no relationships between parents and children. For example, a separate file “all regions of the country”, and separately - “all regions of the country”. In principle, this is the task that we need to solve, but for this we now have all the data.
At first I wanted to use MySQL, fortunately they have support for spatial data types and operations on them, but then it turned out that these very operations were implemented in a very simplified form and were unsuitable for real tasks. A simplified view means that MySQL only works with MBR, the rectangle that describes the polygon. It turns out that instead of the beautiful borders of Crimea, operations will be carried out with a blunt rectangle around it, and the calculation of the real area would be replaced simply by the area of this rectangle (which will be used to solve this problem). This makes MySQL unsuitable for this task.
We use Postgres and install the PostGis library on it to work with geographic data types.
I have prepared a few tables -
The data itself can be loaded in 2 ways:
The resulting borders came out too detailed, with a large number of points. This means longer processing, as well as more data to send to the client in the browser via AJAX, if these polygons need to be drawn on the map. Looking ahead, I’ll say that in this way the data size was reduced by 10 times, and ajax requests began to “fly”.
Therefore, I decided to simplify the polygons, since PostGis has a function for this
After playing with the test data, I chose the smoothing parameter equal to 0.001.
The request is as follows:
Compare the “before” and “after” themselves:
PostGis has different functions for clarifying the relationship between polygons, but I came to the conclusion that it is best to use the following idea: a territory is considered a child if the child territory is more than half the parent territory.
In the formulaic form, it looks like this: [intersection area] / [area of the subsidiary territory]> 0.5
In terms of PostGis functions, it looks like this:
Next, I wrote some similar stored procedures to find the right parent of a given territory.
Here is an example of finding the parent region for the region:
To get the ID of the region that this region belongs to, you need to run the polygons of this region through this function:
If the received ID is greater than zero, then you can save the received region as the parent region of this area.
For cities, it’s even simpler, since the city is set only by the coordinate of the center - we use the function of checking the full entry
I hope this article will be useful for other lovers of surveying.
In this article, I will describe how I solved this problem using KML border-framing files and Postgres + Postgis.
The fact is that the database that we use for our project is not commercial (user generated, open source) and there are errors in it. For example, the most common case is that many cities are assigned to a country, but do not belong to any of its regions and regions, we call them orphaned cities .
Plus, our business is tourism, so administrative and political fragmentation of countries is not always suitable, sometimes no, no, and you have to add tourist regions manually. For example, there is no such administrative region as “Southern Coast of Crimea”, but there is such a tourist region where tourists choose where to go - they look for “houses in the South Coast” rather than “houses in Yalta, Gaspra, Gurzuf and generally somewhere there ".
The question is how to automatically find such an administrative region of the parent (Crimea) and ask all the children included in it (cities like Yalta and Sudak).
By the way: our country consists of regions, regions consist of regions, regions of subregions. Just like the death of Koscheev ...
To solve this problem, it was decided to use data with the borders of the regions, and to compare parents and children.
Preparing data
A database of the borders of the regions of the part of the world you are interested in can be found on the Internet. I used files for Google Earth - either KML (XML with coordinates) or KMZ (KML in the zip archive), they are convenient to watch directly in Google Earth.
KMZ could not be automatically turned into KML - Unicode characters in toponymy names break down, so I opened KMZ files in Google Earth and saved it as KML. Of course, “it is unworthy of a talented person to spend hours, like a slave, on calculations, which, of course, could be entrusted to any person, if one were to use a machine”, but if there were more files, he would definitely look for an automation tool.
The second way, if you need custom areas like the South Coast, is hand-drawn in Google Earth and saved in a KML file.
In the data files that I had, there were distinct regions of each region, but there were no relationships between parents and children. For example, a separate file “all regions of the country”, and separately - “all regions of the country”. In principle, this is the task that we need to solve, but for this we now have all the data.
We are preparing the platform
At first I wanted to use MySQL, fortunately they have support for spatial data types and operations on them, but then it turned out that these very operations were implemented in a very simplified form and were unsuitable for real tasks. A simplified view means that MySQL only works with MBR, the rectangle that describes the polygon. It turns out that instead of the beautiful borders of Crimea, operations will be carried out with a blunt rectangle around it, and the calculation of the real area would be replaced simply by the area of this rectangle (which will be used to solve this problem). This makes MySQL unsuitable for this task.
We use Postgres and install the PostGis library on it to work with geographic data types.
Download data
I have prepared a few tables -
country
, region
, area
, subarea
. For each of them, tables were also made with the suffix " _boundary
", since the same region can have more than one border (for example, islands). The data itself can be loaded in 2 ways:
- being unfamiliar with the PostGis arsenal, I wrote my KML handler - in fact, XML nodes with coordinates were parsed and turned into SQL polygons,
- alternative to manual work - a function built into PostGis
ST_GeomFromKML
Clean data
The resulting borders came out too detailed, with a large number of points. This means longer processing, as well as more data to send to the client in the browser via AJAX, if these polygons need to be drawn on the map. Looking ahead, I’ll say that in this way the data size was reduced by 10 times, and ajax requests began to “fly”.
Therefore, I decided to simplify the polygons, since PostGis has a function for this
ST_Simplify
- it needs to give the polygon and the smoothing value to the input. After playing with the test data, I chose the smoothing parameter equal to 0.001.
The request is as follows:
UPDATE "subarea_boundary"
SET "path" = ST_Simplify("path", 0.001);
Compare the “before” and “after” themselves:
We select a parent
PostGis has different functions for clarifying the relationship between polygons, but I came to the conclusion that it is best to use the following idea: a territory is considered a child if the child territory is more than half the parent territory.
In the formulaic form, it looks like this: [intersection area] / [area of the subsidiary territory]> 0.5
In terms of PostGis functions, it looks like this:
ST_Area(ST_Intersection(region_path, area_path)) / ST_Area(area_path) > 0.5
Next, I wrote some similar stored procedures to find the right parent of a given territory.
Here is an example of finding the parent region for the region:
CREATE OR REPLACE FUNCTION "SuggestRegion" ("area_path" geometry) RETURNS integer AS
'DECLARE
admId integer := 0;
BEGIN
SELECT INTO admId "parent_id"
FROM "region_boundary"
WHERE ST_Area(ST_Intersection(region_path, area_path))/ST_Area(area_path) > 0.5
LIMIT 1;
RETURN admId;
END;'
LANGUAGE "plpgsql" COST 100
VOLATILE
RETURNS NULL ON NULL INPUT
SECURITY INVOKER
To get the ID of the region that this region belongs to, you need to run the polygons of this region through this function:
SELECT "SuggestRegion" ("path") AS parent_id
FROM area_boundary
WHERE area_id = XXX
LIMIT 1;
If the received ID is greater than zero, then you can save the received region as the parent region of this area.
For cities, it’s even simpler, since the city is set only by the coordinate of the center - we use the function of checking the full entry
ST_Within
. I hope this article will be useful for other lovers of surveying.