All-Russian population census: how your data is mapped
I have been working with the recognition and processing of census and agricultural census data from the year 2000. This is the case when you have been writing software for more than a year , which should work out once, but without errors.
Why? In 2010, 500 thousand people and another 10 thousand IT users in all subjects of the Russian Federation participated in the All-Russian population census . The scanner picks up 150 sheets per minute. Real-time recognition at approximately the same speed. Multiply by the number of scanners in the country - and get a data stream where any bug immediately ruins the work of a huge number of people.
And the second point - together with the Research Institute of Statistics, we are conducting research work on data recovery algorithms.
How is the census
If this is an all-Russian population census, approximately half a million people (most often students) bypass all people in the country. The task is to reach everyone and ask a series of questions, the answers are recorded on paper on a special machine-readable form. If the agricultural census - fewer people go, but still. Here, for example, is the standard portfolio of the agricultural census census taker with whom he walks on his land:
Next, get from these forms tens of millions of tables, each of which has specific data on areas important for services of different levels.
That is, the procedure is as follows:
• Prepare lists of surveyed objects and divide them into sections for enumerators;
• Collect data physically, “kicked”.
• Download machine-readable documents to a streaming scanner that quickly and gently flips through them.
• Recognize what is recognized (and here, for a second, handwritten handwriting).
• Make several corrections for what was not recognized so that the operator can finish off the data from the forms by hand.
• Once again check the data for consistency with each other according to the logic (a grandfather cannot be younger than his son, and so on).
• Collect a common database from across the country.
• If necessary, upload this database to the analytics system so that the customer can make atypical reports himself and cut an unrealistic sea of reports from it.
• Secure scans of paper forms for storage by secure mail;
• Arrange on-site storage of paper forms.
Many census survey participants see a computer for the first time in their lives (I’m not exaggerating, both our mouse and mouse moved out of habit, and much more happened in the villages). Plus, not everyone understands the census procedure to the end; there are many non-trivial operations. Naturally, this causes a sharp increase in the load on support, which is extremely undesirable on peak days. Therefore (even though we were not asked about this), we recorded a 40-minute training video explaining all aspects of how to make a census in steps. Here is a short excerpt from 2004 (as previously written on pirated discs - “voiced by professional programmers”):
On the other hand, on agricultural censuses, former agronomists and cooperative chairpersons questioned. They vividly understand the topic, and are interested in the result, because they themselves have repeatedly used the collected data in their work. It is very pleasant to work with these people. They often also do not understand where to feed the computer, but are not afraid to ask questions and learn. And they also have a damn important property for the integrity of the data - they can determine how many pigs she has by her grandmother’s eye and whether she shoved one from the scribe. By the way, about deep knowledge of the topic - not all testers knew that in one of the regions several hectares of hemp were grown. Because it is the most valuable strategic raw material. For medicine and light industry.
For the following such censuses on an agricultural topic, the customer generally wants to get rid of the paper: distribute tablets to the representatives so that the data is clogged immediately in them. There, of course, there are features with personal data - you need to come up with a solution that prevents leakage even during rooting, but this is all solved.
I'll start a little from the end. Given the size of the database, a suitable solution is Microsoft SQL + Microsoft OLAP. When we started working with MS OLAP for generation, we had very little experience, but we had faith in ourselves and the will to win. But then they never regretted it. There are only a few projects of this scale in Microsoft OLAP in the world. Naturally, we walked along the rake and stumbled upon errors that could not be detected in the tests - the developers simply did not have a living base of such a volume and a couple of powerful data centers on the side, grinding the data. By the way, the data center of Rosstat.
The entire primary is processed locally, the data is checked for completeness and consistency. Then the data goes to the data center in Moscow in two ways:
- Digitally processed - via VPN from operator workstations.
- Scans of paper originals - courier mail. From disks, everything is loaded into the database already here. Physically, all this lies in secure premises, the mail system of this class itself is designed even for sending top secret documents.
So, we get about 6 TB of raw data for processing, from which we get a database of size under 500 GB. At this level, data recovery to representative is required. For example, in the district there were about 2 thousand people who participated in the census and 15 “refuseniks” who were not found or who were not reached for other reasons. It is logical to assume that statistically (and we are only interested in large numbers) they will, on average, correspond to other residents of the region. This is a very simplified example of how data is restored. In practice, we, together with the research institute, confirmed the following hypothesis with a series of experiments: if you take a fairly large array of answers, where everything is filled (real census data of past years), then randomly delete up to 10% of the answers, and then restore the data,
A lot of solutions are used - from searching the database of similar profiles (for example, we know the gender and age structure of a farmer's family that has not been surveyed - the algorithm will look for similar families in regions with similar conditions and rely on them, etc.). In practice, only in our country there is a ready-made mechanism for working with such algorithms. The same research institute working with statistics cannot - it does not have enough data center capacity to parse huge bases.
Another important component of report processing is the special BI of our Australian colleagues working with Big Data. Important feature - privacy protection. The first layer is the inability to upload reports, where it is possible to get to specific numbers per person. No matter how hard you try, the internal processing unit is 3 people. Another special analytics makes sure that it is impossible to unload a report containing a matrix corresponding to another matrix with similar data. Because cunning pentesters at the discussion of defense learned to subtract some matrices from others in order to get specifics about people. Now a special mechanism is following this. BI is called SuperStar.
Data in the region
Unlike elections, when residents themselves come to the polls (and if someone does not come, it's okay) for the census, you need to go to everyone and get the most complete data. Ok, the student collected the papers, filled them correctly if possible, checked them and brought them to the district center. Then they get under the protection of the police (police) in the territorial statistical offices, where there is a scanner of machine-readable documents. From the scanner, the paper goes under protection.
Papers come tied to sections. For example, "here is the package, there are 400 people, this is such and such a village." The system of dividing into accounting units was rebuilt in the USSR, it works like a clock.
Further, comparing the completeness of the data is a difficult job, allowing, for example, according to the data of the grandfather’s questionnaire with three grandchildren, to understand that these grandchildren must be somewhere, and if they don’t exist, then something went wrong. In such a procedure, for example, we found a single camel in the Chelyabinsk region. They almost went crazy, they thought a bug, they asked us to check - there really is someone holding a camel. Often there are situations like filling errors - two cows, five of them dairy. With tablets it will be easier, there will be a lot of checks at the UI level.
The input complex is one of the interesting parts. At first, our Russian industrial scanners stood, as in the photo, but in the last census, foreign ones were already used. 150 sheets per minute. World practice is to give further to a recognition machine, then to a verification one. Three cars are wild luxury, so we collect one PAK, where right during the scan, the operator can see the data on the screen and edit what the system was unable to “chew”.
Naturally, the greatest difficulty at this stage is caused by different handwriting. Fortunately, we have a lot of reference data - there are a lot of tags on machine-readable documents that allow you to accurately determine the direction of the text, where it is on the page, and so on. Where there should be numbers, where is the name of the village and so on, which reduces the number of hypotheses. Therefore, we were able to drive into recognition not only more or less printed numbers, but also many handwriting samples. At the first censuses, we gathered a database of the most common handwriting features and were able to successfully recognize the vast majority of handwritten texts on our forms.
The “help the robots” training screen: less loops, lines as far as possible without breaking, do not circle the numbers a second time, try not to go out of the field. There are still bad options, but after training they are much less.
As a result, quite a bit, significantly less than a percent, you need to edit your hands. A special database of poorly recognized documents is being collected, which is being sought out by operators.
Then - another test, this time physical. Judging by the mass, there should be a kilogram of documents - 20 pieces of paper are not enough. Have you forgotten under the table?
Then formal logical control, establishing data links.
And only then sending.
Due to the automation of almost every step, we have reduced the number of required personnel very significantly. For example, even the same route sheet is compiled automatically, which optimizes the time to go around the site.
The staff in such events is the most expensive pleasure, and even 5-7 days of the TierIII data center in comparison with this is a penny.
The setting of tasks on such projects is very, very unusual. The customer perfectly understands its specifics, is ready to explain - but does not think in terms of development. The first time we got a 700-page brick - an almost literary text as TK, which the analyst turned into requirements. The second time and further, the customer already began to understand how to explain this to us, and we began to deeply understand the topic and understand their jargon. Practice shows that it is worth taking, for example, a leading tester after receiving a task, and not before, and that’s all, somewhere, he will poke on ignorance of the specifics. We are greatly appreciated for our deep knowledge of the topic - this is the key to the development of such solutions.
In a short time we dig a bunch of data. There is no chance to repeat the procedure, so huge budgets go to testing. We even recruit specially trained collective farmers-pensioners whose task is to be as harmful as possible. We cope. We understand that the census participants are professionals in their field, and it’s completely normal not to work with IT. We make very simple interfaces. We are thinking about usability of recognition-verification solutions. We save time and nerves to many. It is difficult and very interesting.
The next census of the All-Russian Agricultural Exhibition will be in 2016. All-Russian population census - scheduled for 2020. For professional questions, you can write to me at ICherepov@croc.ru or right here in the comments.