How to choose an algorithm for an address filter
Quite often, articles with new algorithms for automatic parsing of addresses written in one line appear on Habré. In addition, address processing services are provided by various IT companies. In this article, we will tell you how to use your address base to select an automatic address parsing algorithm, and what you should pay attention to when testing and developing address filter algorithms.
This article is for everyone who stores customer data and wants to solve one of the following problems:
- make sure that the address exists so as not to send the parcel or letter to nowhere;
- break down the address into components in order to understand where sales are going better;
- supplement the address with missing information in order to optimize the couriers work plan;
- standardize addresses to find duplicate records of the same client;
- to update and bring the addresses to the format of the directory in order to pass checks of regulators.
The task of automatically parsing mail addresses seems quite simple at first glance - take and compare the address directory ( for example, FIAS ) with words from the input line. But everyone who undertakes it is buried in a large number of address features ...
What do we know about addresses
First, introduce ourselves. We have been dealing with the task of automated address parsing for more than 9 years. During this time, we worked both with large companies and with small firms. We have accumulated a large selection of addresses describing the format of customer data in order to understand well how our ideas affect the quality of address processing in real systems.
Over the past year, we have been developing a new version of the algorithm (we call it an address filter) in order to put an end to the address parsing algorithm.
Define the task
We know three ways to get the current valid addresses:
- get a good address from the client right away (for example, using address hints );
- hire operators to manually parse addresses;
- automatically parse data.
The first option is the best, but not suitable for those who already have a large database of addresses of dubious quality.
The second option has a large percentage of well-sorted addresses, but, as our practice shows, is expensive and lengthy.
The third option by itself will never provide the same percentage of well-sorted addresses as in the second option, but cheaper and much faster.
We recommend using a synthesized version of the 2nd and 3rd points:
- parse addresses automatically with an indication of the quality of address parsing;
- addresses with a good quality indicator should be sent to business processes, and with a bad one, they should be sent to operators for analysis.
So you get a large percentage of parsed addresses for an acceptable amount.
If you decide to use this option or to sort addresses only automatically, you will need to choose the right algorithm for automatic data parsing. How to do this, we will tell further.
To select an algorithm, it is necessary to analyze the results of processing a certain volume of addresses with different algorithms. It seems logical to take part of the addresses from real data and supplement them with addresses with cosmetic corrections in order to check what percentage of addresses with errors and typos will be recognized correctly.
First mistake: automatically correct any typos - well
Most of our customers who first encountered the task of automatic address parsing, and we ourselves at first thought that typo correction was the main thing that any self-respecting algorithm should be able to do.
Subsequently, we realized that typo correction looks beautiful only at the stage of demonstrations, when instead of checking the algorithm at their addresses, customers invent unprecedented cases, admiring the transformations of the form “Ikhonravova, Maskva, Yubileinoy, M.K.” in "Moscow region, Yubileiny, Tikhonravova street . " In combat conditions, this functionality is not only not used, but also harms working with the main base of addresses.
Our studies show that in the source addresses of corporate systems rarely more than 2% of addresses with typos are rare - among all our customers the percentage of such systems is less than 5%. In this case, most typos (about 95% of all typos) are systemic in nature, that is, it is either a common typo, for example, Maskva , or a correction of the appearance of the street. 3rd Mytishchinskaya >>> st. 3rd Mytishchinskaya or st. Tolstoy >>> st. Tolstoy . These typos can be described by a finite set of rules that will correct them.
Why is bad typo correction in the general case? By correcting all typos by n-grams, Levenshtein distance, etc., the algorithm tries to pull the address to the directory with a great chance of getting completely different from what was meant in the source address. In addition, the source address may contain additional information that is not in the address directory: name of the company, business center, how to get from the metro, etc. In the typo-correction algorithm, these additions are very likely to be perceived as a normal component of the address.
Over 9 years of work, we came to the conclusion that it is necessary to correct typos only according to the rules, which guarantee that this typo can only be brought to the correct analyzed options.
Thus, we advise you to check the algorithms only on real data without artificial distortions. For example, if you have the address Moscow Pushkin 13 in your database , then use it, and not Mask Pushikino 13 .
Typos should be treated with caution. The worst possible result of using an algorithm with the typo correction logic described above is getting incorrectly parsed addresses with a good quality code.
The second misconception: the percentage of well-sorted addresses is the main criterion for choosing a filter (except for the cost, of course)
Any algorithm for automatic parsing of addresses for input accepts an address, and at the output it produces it in a standardized form. Usually he can return a sign indicating whether the algorithm is confident in parsing the address or not. This feature is usually called a quality code.
The addresses of our customers with a good quality code for parsing automatically go into business processes, and with a poor quality code they are sent for manual parsing. The larger the percentage of addresses with a good quality code, the more the customer saves on the process of manually processing addresses.
Thus, the main criterion for choosing an algorithm is the percentage of addresses with a good quality code.
Often they forget one important point: it’s much cheaper to bring an address with a bad quality code to a good one manually than to correct the consequences in the system that will result in incorrectly recognized addresses with a good quality code.
For example, now we are developing a system for assessing the value of real estate, where for each house the cost per square meter is known, which is used to assess the solvency of the client when issuing a loan. The system automatically analyzes new advertisements for the sale of apartments on the network, standardizes the address and adjusts the average cost in the directory. In the event that among the standardized addresses there will be many addresses with incorrect parsing and a good quality code, we will have many errors in the directory, where instead of the real average cost of the apartment it will be several times higher or lower. Such addresses are difficult to find, while they have a strong negative impact on business processes.
That is why automatic correction of all typos is bad: the algorithm tries to attract a deliberately bad address to a directory with a good quality code, which increases the percentage of reverse error, that is, the percentage of addresses with a good quality code, but incorrectly standardized.
What addresses to look for
Comparing the address filter algorithms, look not only at the percentage of addresses with a good quality code, but also at the percentage of incorrectly sorted addresses with a good quality code. It is best to prepare a sample of your addresses, including cases of writing addresses of increased danger, namely:
- Addresses with typos or incorrect indication of the address component (for example, 3rd Mytishchi instead of 3rd Mytishchi ).
- Ambiguous addresses for which only from the source data it is impossible to unambiguously determine what is being discussed, including during analysis by the operator. For example, missing or incorrectly specified address components: Moscow, Tverskaya may mean both Tverskaya Square and the street.
- Error in specifying the type of address component. According to our data, about 5% of customer addresses contain some kind of error indicating the type of address component: instead of “urban-type village” they write “village”, instead of “dead end” they write “lane” and so on.
- An error in specifying the component itself. Most often incorrectly indicate:
- The area in which the settlement is located, if it is located on the border of two districts. For example, in the address Moscow Region, Dmitrovsky District, Zaprudnya village, the region is incorrectly indicated, correctly Taldomsky.
- The region in which the object is located. This is especially common with the addresses of Moscow and St. Petersburg, for example:
- Leningrad Region, St. Petersburg, Fontanka
- Moscow region, Moscow, st. Rastorgueva
- Moscow region, Zelenograd, to 3113
- A mistake in the name, rank or other common misconception. For example, instead of Aleksey Tolstoy street they write Leo Tolstoy , or instead of Marshal Zhukov they write General Zhukov . Sometimes they also give some outdated or local name, known only to local residents.
- Duplication of words in the source line. Sometimes after many conversions and transitions from system to system, addresses with duplicate components are formed. For example, one conference had an address: Moscow, Moscow, Moscow, Moscow, Moscow, Leningradsky Prospekt, 39, p. 79 . Obviously, here the word Moscow was written several times in error and the algorithm may not take into account duplicates from the source address. But is it possible to always delete duplicates? Another example: Sakhalinskaya, Yuzhno-Sakhalinsk, Sakhalinskaya means the address of the Sakhalin region, Yuzhno-Sakhalinsk, Sakhalinskaya street . A good algorithm is to find duplicates only if they are really duplicates and do not divert the address to the wrong parsing.
- Garbage or additional information in the source data. Usually this is either a full name or additional information about the building itself and how to get to it. For example: Ivanov Ivan, Pirogova 20 to 1 total 8/1 room 313, Novosibirsk, NSO or Moscow, Turchaninov, BC Krymsky bridge, house 6, 2, 2 minutes from the metro and a good salary . In such cases, all components of the input line that are not address components or frequently encountered additional address information (for example, metro or city districts) should be submitted to the operator for analysis, since they can affect the quality of address parsing.
- Outdated addresses. These are addresses that are now different: it happens, the streets are renamed, it happens that they move to another settlement, unite, etc. When there are two addresses: Samara 13 passage and Samara George Ratner , it would be nice to understand that this is the same address. The algorithm should be able to update the address and give it a good quality code only if updated.
When we prepared the test sample, then everything is simple. We process addresses with various algorithms and compare them according to the criteria:
- Percentage of parsing good addresses (i.e. addresses without garbage, ambiguities, and typos). The algorithm should be able to correctly parse good addresses with a good quality code.
- Percentage of parsing bad addresses. The algorithm should be able to parse bad addresses as well as possible, that is, if the address is bad, but can be well parsed with a good quality code, then the algorithm should be able to do this.
- The percentage of addresses with a reverse error. The algorithm should contain a minimal inverse error, that is, do not put down a good quality code for addresses with incorrect parsing. It seems to us that this is the most important point of all.
- The presence of additional properties of a standardized address. The algorithm should provide convenient levers for analyzing and working with addresses with poor quality codes. At the same time, working with tools should be simple and straightforward.
The task of automatic address parsing is not as simple as it seems at first glance. If you decide to choose an algorithm for parsing addresses or write your own, then you need to approach this process correctly: analyze existing addresses, make a representative sample for tests. We hope that this article will help you in this work and all your addresses will be sorted automatically and correctly.
PS: Within a month we will install a new version of the address filter, which was discussed at the beginning of the article, on dadata.ru . Register to stay informed and be among the first researchers of the new algorithm.
Thanks to chipQA for helping with this article.