Open data from the developer
In the process of working on a mobile application using open data, I had to get acquainted closely with the content of a number of portals, as a result, there were proposals to improve the “inner world of open data portals” in the interests of the developer.
If you are interested in this and you already have experience in this field, then you can compare your findings with the one written below.
At the heart of working with any portal is a dataset passport. If you want to access a data set, find its passport, extract the name of the set, number, link to the set and description of the fields in the set.
Everything seems to be logical, from the point of view of a manual file cabinet, which a person leads, but from the point of view of the application, this is not enough, since it is impossible to obtain any information about the contents of a data set in software.
The developer must first ask himself what kind of set it is, where it is located, in what format its data is.
The passport of the kit should help the application tell about its contents.
It is possible. It is only necessary to create on the network a register of all Russian open data portals, adopt a unified numbering structure (or ID) for the data sets and standardize the procedure for naming and content of their fields.
1. Register of Russian open data portals.
A place on the network where all open data portals are listed (registry).
Today we are looking for links to portals on the network using a search engine, and after finding a portal, we get acquainted with its contents. Site / Pages with a list of all the portals of open data of Russia and links to them, yet (or it is unknown to me).
2. Unified structure of the number (or ID) of the data set.
They came to the portal (using the registry of item 1), I would like to understand what is posted on it. Each set is determined by the name and its number / id.
Today, whoever wants to, numbers like that. On one portal these are numbers, on the other words, on the third sentence. On some portals, the TIN is included in the number / name of the data set; it’s already good, you can pull out the region (if you understand that there is a TIN), but this is not enough.
Sets are assembled in categories (not yet everywhere), which is very logical when assembled. But the implementation of directories of categories, each has its own. The level of portals in terms of information coverage is also different; there are federal, regional, city and village ones.
As a result, you want to find and use information from different portals in one application, develop your numbering of portals, sets and categories.
Why not fill the number / id of the data set with meaning, to make it easier to understand the contents of the data set programmatically.
To do this, it is enough to include in the set number (its id):
ID1 - a unique portal number (tied to the federal / region code / year of the city / code ...), it is the portal id in the unified Russian classification,
ID2 - a single category information code in the set data (that is, a unified directory of categories should be developed and approved),
ID3 is the dialing number on the portal.
Any department wants to organize an open data portal, sends an application to the accounting state agency, receives for itself ID1 - a portal and a single directory of categories. As a result, the data set on the portal will receive a number in the form:
ID1-ID2-ID3, and the developer will receive a ready-made mechanism for quickly searching for the necessary data sets on any portal.
It will not be necessary for each city to create a separate wonderful application that implements a unique service. Change the link to the portal depending on the user's geolocation, and use the application in any region. It will easily find the right portal and category. And if necessary, he will easily “pull out” all available portals in the selected category of any city, region or country.
3. A standardized approach to the name and content of the fields of data sets.
Found the right set. Now it is necessary to deal with its internal structure.
And here today everyone has their own way.
- Fields with data coordinates (geographical coordinates). Someone in their sets calls them latitude and longitude (two different fields), someone geo-coordinates and puts the values in one field (x, y), someone stores as an array, and someone as a dictionary . At the same time, several options can easily be present on one portal (RosTourism, Moscow Government). The developer would be pleased to see a single option for storing geo-coordinates on all portals. Which, anyone, suits everyone. And in the passport of the kit must be present a mark not only about their availability, but also the type. Point, line or region.
- Content fields. It can be anything, except for the html-pages placed inside, which the portal of the Ministry of Culture is filled with today. The developer needs information, as it is customary to write in defining documents, in machine-readable form, and not in the form of html-pages for subsequent parsing and searching for information in them as well. The developer is not going to replicate someone’s site, he is creating his application. And do not forget about traffic, which increases dramatically when receiving information in the form of html, which drags markup and fonts along with it. A mobile application is a way of avoiding HTML, getting "bare" information, and not just another option for displaying web pages.
- Link to pictures. Today, this is another area for the imagination of data portal owners. It even comes to storing links to pdf-files with pictures (Ministry of Culture). And this is done, without specifying a real file extension. The developer, seeing a link to the picture, expects to find anything there, but not a pdf-file.
- Link to documents. Standardization is also needed.
- The size of the photos. Unfortunately, they are almost always offered the way they were shot, that is, without thinking about the fact that they are not needed in high resolution for web and mobile applications. Spread what is available. A very illustrative example, a directory of employees on the portal of the Moscow Government. Take a look. From small, scanned photos from documents to huge photos in the interior of the office. It would be nice to have a certain standard here too.
3. Data availability.
Sometimes the portals are maintenance work. About which no one is informed. Clearly, this cannot be done. But at this time, the application using open data becomes unworkable and takes on all the anger of its user. Which does not understand that the application is not at all guilty. For him, it is "buggy." Because of this, users refuse the application or write bad reviews about it. For example, on the portal of the Government of Moscow, this happens periodically on Saturdays.
Portals must be required to add a portal request for Open / Closed to their APIs. And if the portal still returns the estimated date of its alleged opening ... Then we will live.
4. Data quality.
Portal data quality suffers from two factors. Errors in the data itself and errors in the structure of discarded data.
Grammatical, in the names of the fields and content, such as "photography" somehow survive.
Errors in the data.
It’s bad when for more than a year, in the data set, the metro entrances / exits on the Moscow Government portal, there are not separate entrances, but whole metro stations. And from the “public transport stop” data set, it turns out that the bus on route XX stops at only one stop. Where else his route passes is unknown. Or in the middle name field, the surname is registered, and in the field the last name is the middle name, the portal of the Ministry of Culture. Again, there is redundancy, why enter separate fields name, surname and middle name, if there is a full name field in the same set? Trivia, such as confused latitude and longitude, we do not consider yet.
Errors in the data structure.
This group of errors appears when dumping data in csv format and is associated with the delimiters used in this format. It is very easy to detect data with a separator "," where the same comma is present inside the fields themselves. As you know, in this case it is impossible to correctly separate a line into separate fields. As in the case when there are linefeeds inside the field. A similar situation occurs when all the delimiters in each line are simply not displayed. They are not enough. Apparently, when filling the data set on the portal, a beautiful XLS file is taken and is directly reset with all its headers. You can’t do this.
CSV file, due to such errors, this is a quiet horror for the developer, a nightmare. Try to explain to the user that the received data cannot be understood due to the broken structure. But this format is still leading.
And the most unpleasant mistakes are when, according to the above description of generating requests for developers on the portal of the Government of the Moscow Region, you should get json, but you get an answer in the form of csv, I don’t even know which category to assign it to. You have to immediately insert into the code a check of what has arrived and choose the processing of the received data depending on what you gave (csv or json), and not on what is promised by the API description. Being determines the operation of an application.
5. The relevance of the data. 2014 - 2015 are present for everyone, but 2016 and beyond ... This is difficult.
6. Organization of storage and access to data.
Some use OpenData to query and update data, others use MongoDb, and so on. For the developer, each new portal is a new parsing, the application is growing. Instead of working on new functionality, you have to debug the next option for receiving data (request-response). Although the scheme (Passport - Set) is present. You can not do it this way.
It is necessary to negotiate and use a single solution, with a single API.
On the developer’s side, the best option is a cloud portal from one provider, based on one DBMS, with a single API, where any agency can get a place for its open data portal and a universal tool for working with it. I hope that the regulatory organization responsible for the open data program in the country is not difficult to arrange. Plus, she will have real control over the spending of funds allocated for this program.
It is easily implemented for example based on Windows Azure. I am sure that all the advantages of this option in terms of costs, commissioning speed, reliability and cost of ownership are understandable not only to developers.
And a little about where the experience came from, that is, about the mobile application, work on which led to the above.
The application is written for iOS, works with five open data portals, these are 1147 sets and information from the Central Bank of Russia.
- The portal of the Government of Moscow - 697 sets,
- The Government of the Moscow region - 266 sets,
- the Ministry of Culture - 49 sets,
- the Federal Agency for Tourism of Russia - 135 sets.
- The Central Bank portal is not called directly the open data portal, in fact, it is such, since it provides information on the rates of currencies quoted by it for any period. Necessary information for translating ruble statistics posted on open data portals into any currency equivalent.
If you are interested in this and you already have experience in this field, then you can compare your findings with the one written below.
At the heart of working with any portal is a dataset passport. If you want to access a data set, find its passport, extract the name of the set, number, link to the set and description of the fields in the set.
Everything seems to be logical, from the point of view of a manual file cabinet, which a person leads, but from the point of view of the application, this is not enough, since it is impossible to obtain any information about the contents of a data set in software.
The developer must first ask himself what kind of set it is, where it is located, in what format its data is.
The passport of the kit should help the application tell about its contents.
It is possible. It is only necessary to create on the network a register of all Russian open data portals, adopt a unified numbering structure (or ID) for the data sets and standardize the procedure for naming and content of their fields.
1. Register of Russian open data portals.
A place on the network where all open data portals are listed (registry).
Today we are looking for links to portals on the network using a search engine, and after finding a portal, we get acquainted with its contents. Site / Pages with a list of all the portals of open data of Russia and links to them, yet (or it is unknown to me).
2. Unified structure of the number (or ID) of the data set.
They came to the portal (using the registry of item 1), I would like to understand what is posted on it. Each set is determined by the name and its number / id.
Today, whoever wants to, numbers like that. On one portal these are numbers, on the other words, on the third sentence. On some portals, the TIN is included in the number / name of the data set; it’s already good, you can pull out the region (if you understand that there is a TIN), but this is not enough.
Sets are assembled in categories (not yet everywhere), which is very logical when assembled. But the implementation of directories of categories, each has its own. The level of portals in terms of information coverage is also different; there are federal, regional, city and village ones.
As a result, you want to find and use information from different portals in one application, develop your numbering of portals, sets and categories.
Why not fill the number / id of the data set with meaning, to make it easier to understand the contents of the data set programmatically.
To do this, it is enough to include in the set number (its id):
ID1 - a unique portal number (tied to the federal / region code / year of the city / code ...), it is the portal id in the unified Russian classification,
ID2 - a single category information code in the set data (that is, a unified directory of categories should be developed and approved),
ID3 is the dialing number on the portal.
Any department wants to organize an open data portal, sends an application to the accounting state agency, receives for itself ID1 - a portal and a single directory of categories. As a result, the data set on the portal will receive a number in the form:
ID1-ID2-ID3, and the developer will receive a ready-made mechanism for quickly searching for the necessary data sets on any portal.
It will not be necessary for each city to create a separate wonderful application that implements a unique service. Change the link to the portal depending on the user's geolocation, and use the application in any region. It will easily find the right portal and category. And if necessary, he will easily “pull out” all available portals in the selected category of any city, region or country.
3. A standardized approach to the name and content of the fields of data sets.
Found the right set. Now it is necessary to deal with its internal structure.
And here today everyone has their own way.
- Fields with data coordinates (geographical coordinates). Someone in their sets calls them latitude and longitude (two different fields), someone geo-coordinates and puts the values in one field (x, y), someone stores as an array, and someone as a dictionary . At the same time, several options can easily be present on one portal (RosTourism, Moscow Government). The developer would be pleased to see a single option for storing geo-coordinates on all portals. Which, anyone, suits everyone. And in the passport of the kit must be present a mark not only about their availability, but also the type. Point, line or region.
- Content fields. It can be anything, except for the html-pages placed inside, which the portal of the Ministry of Culture is filled with today. The developer needs information, as it is customary to write in defining documents, in machine-readable form, and not in the form of html-pages for subsequent parsing and searching for information in them as well. The developer is not going to replicate someone’s site, he is creating his application. And do not forget about traffic, which increases dramatically when receiving information in the form of html, which drags markup and fonts along with it. A mobile application is a way of avoiding HTML, getting "bare" information, and not just another option for displaying web pages.
- Link to pictures. Today, this is another area for the imagination of data portal owners. It even comes to storing links to pdf-files with pictures (Ministry of Culture). And this is done, without specifying a real file extension. The developer, seeing a link to the picture, expects to find anything there, but not a pdf-file.
- Link to documents. Standardization is also needed.
- The size of the photos. Unfortunately, they are almost always offered the way they were shot, that is, without thinking about the fact that they are not needed in high resolution for web and mobile applications. Spread what is available. A very illustrative example, a directory of employees on the portal of the Moscow Government. Take a look. From small, scanned photos from documents to huge photos in the interior of the office. It would be nice to have a certain standard here too.
3. Data availability.
Sometimes the portals are maintenance work. About which no one is informed. Clearly, this cannot be done. But at this time, the application using open data becomes unworkable and takes on all the anger of its user. Which does not understand that the application is not at all guilty. For him, it is "buggy." Because of this, users refuse the application or write bad reviews about it. For example, on the portal of the Government of Moscow, this happens periodically on Saturdays.
Portals must be required to add a portal request for Open / Closed to their APIs. And if the portal still returns the estimated date of its alleged opening ... Then we will live.
4. Data quality.
Portal data quality suffers from two factors. Errors in the data itself and errors in the structure of discarded data.
Grammatical, in the names of the fields and content, such as "photography" somehow survive.
Errors in the data.
It’s bad when for more than a year, in the data set, the metro entrances / exits on the Moscow Government portal, there are not separate entrances, but whole metro stations. And from the “public transport stop” data set, it turns out that the bus on route XX stops at only one stop. Where else his route passes is unknown. Or in the middle name field, the surname is registered, and in the field the last name is the middle name, the portal of the Ministry of Culture. Again, there is redundancy, why enter separate fields name, surname and middle name, if there is a full name field in the same set? Trivia, such as confused latitude and longitude, we do not consider yet.
Errors in the data structure.
This group of errors appears when dumping data in csv format and is associated with the delimiters used in this format. It is very easy to detect data with a separator "," where the same comma is present inside the fields themselves. As you know, in this case it is impossible to correctly separate a line into separate fields. As in the case when there are linefeeds inside the field. A similar situation occurs when all the delimiters in each line are simply not displayed. They are not enough. Apparently, when filling the data set on the portal, a beautiful XLS file is taken and is directly reset with all its headers. You can’t do this.
CSV file, due to such errors, this is a quiet horror for the developer, a nightmare. Try to explain to the user that the received data cannot be understood due to the broken structure. But this format is still leading.
And the most unpleasant mistakes are when, according to the above description of generating requests for developers on the portal of the Government of the Moscow Region, you should get json, but you get an answer in the form of csv, I don’t even know which category to assign it to. You have to immediately insert into the code a check of what has arrived and choose the processing of the received data depending on what you gave (csv or json), and not on what is promised by the API description. Being determines the operation of an application.
5. The relevance of the data. 2014 - 2015 are present for everyone, but 2016 and beyond ... This is difficult.
6. Organization of storage and access to data.
Some use OpenData to query and update data, others use MongoDb, and so on. For the developer, each new portal is a new parsing, the application is growing. Instead of working on new functionality, you have to debug the next option for receiving data (request-response). Although the scheme (Passport - Set) is present. You can not do it this way.
It is necessary to negotiate and use a single solution, with a single API.
On the developer’s side, the best option is a cloud portal from one provider, based on one DBMS, with a single API, where any agency can get a place for its open data portal and a universal tool for working with it. I hope that the regulatory organization responsible for the open data program in the country is not difficult to arrange. Plus, she will have real control over the spending of funds allocated for this program.
It is easily implemented for example based on Windows Azure. I am sure that all the advantages of this option in terms of costs, commissioning speed, reliability and cost of ownership are understandable not only to developers.
And a little about where the experience came from, that is, about the mobile application, work on which led to the above.
The application is written for iOS, works with five open data portals, these are 1147 sets and information from the Central Bank of Russia.
- The portal of the Government of Moscow - 697 sets,
- The Government of the Moscow region - 266 sets,
- the Ministry of Culture - 49 sets,
- the Federal Agency for Tourism of Russia - 135 sets.
- The Central Bank portal is not called directly the open data portal, in fact, it is such, since it provides information on the rates of currencies quoted by it for any period. Necessary information for translating ruble statistics posted on open data portals into any currency equivalent.