t13s August 5, 2014 at 19:44

Parsing email addresses from a string in C #

Not so long ago, I faced the task of uploading the data of one of my customers into another near-state format. Among other things, the unloading required to provide the postal addresses of individual customers, including the postcode, region, district, and so on, to the apartment number.

Everything would be fine, only the ambush is that the customer’s source addresses were packed in a simple line like “Kitezhgrad, 22, Volzhebnaya St., Building 15, Quarter 15”. That is, on the one hand, no one has ever heard of zip codes, but on the other hand, the text input field offers a wide scope for self-expression and applied art.

Having no hesitation, I climbed to look for a solution to this problem on the net, reasoning that such a situation should be very common and someone definitely defeated. And it turned out, however, instead of source codes or simply compiled libs, online services eagerly stared at me, offering using their API to parse email addresses for a very real bribe (the minimum price I managed to find was 10 kopecks per address).
Since I didn’t like giving voluntary revenues to some third-party organization, and by that time there was some excitement, I wanted to put the decision together on my own, with as little effort as possible. The task was facilitated by the fact that the customer did not need high parsing accuracy - the presence of errors of any kind would not lead to fatal problems.

To begin with, I looked towards Tomit-parser, but after getting acquainted with the multi-page config of an example that allows me to determine in the text which city lives in ( http://api.yandex.ru/tomita/doc/dg/concept/example.xml ) , optimism somewhat diminished, but the desire to write some kind of your bike became stronger.

Naturally, with rather stringent restrictions on the input data, under which we will continue our research:

The address is always written without typos: “Arphography Avenue” should remain on the conscience of the introducer.
The address is recorded from the most common element (region) to the most private (apartment number).
Subject to clause 2, we hammer a bolt into the hint words such as “region”, “street”, “avenue”, “house”. So if the city has Telepuzikov Avenue and a street named after them, then we will not be able to catch such a fine line. Given the rarity of this situation and the availability of the right to make a mistake, this is quite a working option.

Next, I was puzzled by the search for a data source from which I could draw information on zip codes. As it turned out, to this day KLADR is already yesterday, FIAS rules ( http://fias.nalog.ru ). Having downloaded an offline copy of this database, I began to study the possibilities it provided.

I was especially interested in two tables there: ADDROBJ - it contains a tree-like list of all address objects, starting from the subject of the Russian Federation and ending with the street, and HOUSE <region number> - where house numbers are stored with reference to the entries in ADDROBJ along with their indexes. The information stored in these two tables is enough to achieve both goals: checking the correctness of parsing the address (if we managed to find the address in the database, it means we recognized it correctly), as well as to determine the zip code.

An algorithm began to loom in my head:

We divide the line of the mailing address into the address elements. By an address element I mean what a record of which can be found in the form of a row of the FIAS table: district, city, street, house, as well as the apartment number.
1. Unconditional delimiters of address elements include periods, commas, semicolons, slashes.
2. Conditional delimiters include a hyphen / dash if the address element after the hyphen is a number. For example, in “Depression Lane 38a-117,” the hyphen is a separator, and in “g. Ust-Zazhopinsk "- no.
3. A space may or may not be a separator. So in the “Eighth of March d.15” the gap between the “Eighth” and “March” obviously should not divide the elements, and between “March” and “d.” - it should. The easiest option in the forehead is to compile all possible options for separating address elements by spaces and to continue further work of the algorithm with each of them separately.
Addressable elements such as "street" ("street"), "region" ("region") and so on are completely bitten out.
Starting from the very first element, all of them are sequentially run on the basis of the FIAS.
1. If the element is in the base, then its GUID and LEVEL (level in the hierarchy) are remembered, while the next element is searched with a higher LEVEL value and a fixed PARENTGUID equal to the GUID of the previous element found.
2. If no element is found for the given PARENTGUID, try to build a chain that includes intermediate elements.
3. The initial search is in the table ADDROBJ, as soon as we look for the next element after the street (LEVEL of the street is 7), we switch to the table of houses HOUSEXX.
4. If the address element is not found, just ignore it.
The option wins (and there can be several of them according to the results of step 1.3.), Which has the longest recognized chain.
For the sake of order, it is completed according to the ADDROBJ table to the very top. This is necessary because, for example, the region and the district were not indicated in the original address bar, but the city at once.
Then a little cheat. The apartment number is considered to be the last address element (if it was not recognized as a house number), and the address elements between the recognized house number and apartment number are considered to be the building, building, letter and everything else. It would be possible to build a more detailed analysis - the HOUSEXX table allows this - but it seemed to me unnecessary, if only because postal codes are unlikely to be different for houses "113" and "113 st. 1 building 4 lit. Zh."

The algorithm turned out to be empirical, naive, providing for not all possible situations ... But for the restrictions on the speed of implementation and the extensive rights to error, it looked quite satisfactory. It was possible to compose and implement it in about 1 evening.

For habit and convenience, I overtook ADDROBJ and HOUSEXX tables from DBF to MS SQL (I read how to easily convert them here: http://blogs.technet.com/b/isv_team/archive/2012/05/14/3497825.aspx )

The result is the AddressParser class, which takes an address string at the input and returns an instance of the Address class in response. You can submit your own implementation of IKnwonAddressComparator to the AddressParser constructor if the current implementation, sharpened by MS SQL, is not suitable for something.

The parsing speed turned out to be something around 2-5 addresses per second. Bad, but all better than handles. The main problem: a serious number of options for verification, generated by paragraph 1.3. In a good way, this point should be completely rewritten, using the address database already at this stage to verify the existence of address elements. As an intermediate option, you can limit the number of options to a certain value.

According to a random sample, the quality of parsing was 62% on real data. For evaluation, it was believed that the address can either be fully recognized (up to the apartment) or not. No halftones.

The error distribution is as follows:

37% - typos. As a rule, banal omissions and additions of letters: "Kiyrova", "Moskv" ...
21% - the use of abbreviations: “K.Marx”, “R.Luxembourg” ...
42% - the lack of houses in the FIAS database, and, as a result, the inability to determine the index and validate the entire chain. A very unexpected reason for me, although many write that FIAS is still damp for industrial use

What conclusions can be drawn?

If you need, like me, low quality parsing and low speed - you can use it.

If you need to significantly increase the accuracy of recognition, you can try to introduce a fuzzy search. In addition, by adding a list of popular abbreviations, you can also pull up the percentage of successfully recognized addresses.

Performance is a separate song, which, given the elementary and suboptimal implementation, can also be tackled. The first candidate here is smart space breaks.

But all this is a completely different story.

Source codes can be downloaded from here: https://yadi.sk/d/muzi9b6qZ8DWh
MS SQL test database with houses in 38 and 78 regions can be found here: https://yadi.sk/d/ERXyDXv7Z8Dab

Tags:

Parsing email addresses from a string in C #

Also popular now: