Phonetic Algorithms

Phonetic algorithms match two words with a similar pronunciation with the same codes, which allows you to compare and index the set of such words based on their phonetic similarity.

It is often quite difficult to find an atypical surname in the database, for example:
- Lech, look in our base for Adolf Schwarzenegger ,
- Schwarzeneggir ? There's no such thing!
In this case, the use of phonetic algorithms (especially in combination with fuzzy matching algorithms) can greatly simplify the task.

Such algorithms are very convenient to use when searching the databases for lists of people, in spell checking programs. They are often used in conjunction with fuzzy search algorithms (which undoubtedly deserve a separate article), providing users with a convenient search by first and last name in various databases, employee lists, and so on.

In this article I will consider the most famous algorithms such as Soundex , Daitch-Mokotoff Soundex , NYSIIS , Metaphone , Double Metaphone , Russian Metaphone ,Caverphone .

Soundex


One of the first was the Soundex algorithm , invented back in the 10s of the last century by Robert Russell. This algorithm (or rather, its American version) matches words with a numerical index of the form A126 . The principle of its work is based on the division of consonants into groups with serial numbers, from which the resulting value is then compiled. Later, a number of improvements were also proposed.

image

The first letter is saved, subsequent letters are compared with the numbers in the table. Characters not shown in the table (which are all vowels and some consonants) are ignored. Adjacent characters, or characters separated by the letters H or Wincluded in the same group are recorded as one. The result is truncated to 4 characters. Missing positions are filled with zeros. It is easy to notice that after all these procedures there are only 7 thousand different variations of such a code, which entails a lot of completely dissimilar words having the same Soundex code. Thus, the result in most cases includes a large number of "false positive" values.

image

In the improved version, as you can see, the letters are divided into more groups. In addition, no special attention is paid to the letters H and W, they are simply ignored. In addition, no operations are performed with the length of the result - the code has no fixed length and is not truncated.

Examples

Original Soundex:
D341 → Dedlovsky, Dedlovsky, Didilev, Ditelev, Dudalev, Dudolev, Dutlov, Dydalev, Dyatlov, Dyatlovich.
N251 → Nagimov, Nagmbetov, Nazimov, Nasimov, Nassonov, Nezhnov, Neznaev, Nesmeev, Nizhnevsky, Nikonov, Nikonovich, Nisenblat, Nisenbaum, Nissenbaum, Noginov, Nozhnov.

Superior Soundex:
N8030802 → Nasimov, Nassonov, Nikonov.
N80308108 → Niesenbaum, Nissenbaum.
N8040802 → Nagimov, Nagonov, Neganov, Noginov.
N804810602 → Nagmbetov.
N8050802 → Nazimov, Nezhnov, Scabbard.

On average, there are 21 surnames per Soundex code value. In the case of an improved version of Soundex, only 2-3 surnames are converted to the same code.

NYSIIS


Developed in 1970 as part of the New York State Identification and Intelligence System, this algorithm produces slightly better results relative to the original Soundex, using more complex rules for converting the source word into the resulting code. This algorithm is designed to work specifically with American surnames.

NYSIIS Code Calculation Algorithm
  1. Convert word start according to the following rules:
    MAC → MCC
    KN → N
    K → C
    PH, PF → FF
    SCH → SSS
  2. Convert the end of a word according to the following rules:
    EE → Y
    IE → Y
    DT, RT, RD, NT, ND → D
  3. Then, all letters except the first are converted according to the following rules:
    EV → AF
    A, E, I, O, U → A
    Q → G
    Z → S
    M → N
    KN → N
    K → C
    SCH → SSS
    PH → FF
    After vowels: remove H, convert W → A
  4. Remove S at the end
  5. Convert AY at the end → Y
  6. Remove A at the end
  7. Trim to 6 characters (optional step).

Examples

CASPARAVAS → Kasparavichus, Kasperovich, Kaspirovich.
CATNACAV → Katnikov, Tsitnikov, Tsotnikov.
LANSANC → Lenchenko, Leonchenko, Linchenko, Lunchenko, Lyamzenko.
PRADSC → Parish, Prohodsky, Prudsky, Prudsky, Prudsky.
STADNACAV → Stadnikov.

NYSIIS converts to the same code a little more than two surnames.

Daitch-Mokotoff Soundex


This algorithm was developed in 1985 by two genealogies - Harry Mokotoff and Randy Deutsch, trying to achieve the best, relatively original Soundex, results when working with Eastern European (including Russian) surnames.
This algorithm has little in common with the original Soundex, except that the result is still a sequence of numbers, but now the first letter is also encoded.

It has significantly more complex conversion rules - now not only single characters, but also sequences of several characters are involved in the formation of the resulting code. In addition, the result of the form 023689 provides about 600 thousandvarious code variations, which, coupled with complicated rules, reduces the number of "extra" ones, i.e. "False positive" words in the resulting set.

Transformations are performed according to the following table (the order of transformations corresponds to the order of letter combinations in the table):
Source CombinationsAt the beginningBehind the vowelRest
AI, AJ, AY, EI, EY, EJ, OI, OJ, OY, UI, UJ, UY01
Au07
IA, IE, IO, IU1
EU11
A, UE, E, I, O, U, Y0
J111
SCHTSCH, SCHTSH, SCHTCH, SHTCH, SHCH, SHTSH, STCH, STSCH, STRZ, STRS, STSH, SZCZ, SZCS244
SHT, SCHT, SCHD, ST, SZT, SHD, SZD, SD24343
CSZ, CZS, CS, CZ, DRZ, DRS, DSH, DS, DZH, DZS, DZ, TRZ, TRS, TRCH, TSH, TTSZ, TTZ, TZS, TSZ, SZ, TTCH, TCH, TTSCH, ZSCH, ZHSH, SCH, SH, TTS, TC, TS, TZ, ZH, ZS444
SC244
DT, D, TH, T333
CHS, KS, X55454
S, z444
CH, CK, C, G, KH, K, Q555
MN, NM6666
M, N666
FB, B, PH, PF, F, P, V, W777
H55
L888
R999
“Alternative” letter combinations (used to generate several alternative codes from the source word):
CH → TCH
CK → TSK
C → TZ
J → DZH

Examples

095747Arkhiptsev , Arkhiptsov , Arkhipychev , Artsybasov, Artsybashev, Archibasov
095757 → Arkhipkov, Arkhiptsev , Arkhiptsov , Arkhipychev
584360 → Galstyan, Galustyan, Gilshtein, Glistin, Gluzdan, Holstein, Goldshtein, Golstunlust, Kalustun, Kalustun, Kalustun, Kalustun, Kalustun, Kalustun, Kalustun, Kalustun, Kalustun, Kalustun, Kalustun, Kalustun, Kalustun, Kalustun, Kalustun, Kalustun.

This algorithm converts an average of 5 surnames to the same code.

Subsequently, Alexander Bader and Stephen Morse developed the Beider-Morse Name Matching Algorithm , aimed at reducing the number of "false positive" values ​​relative to the Daitch-Mokotoff Soundex when working with Jewish (Ashkenazi) surnames.

Metaphone


The Metaphone algorithm (1990) has slightly better characteristics , which differs from previous algorithms in a slightly different approach to the encoding process: it converts the original word taking into account the rules of the English language, using noticeably more complex rules, and significantly less information is lost, since the letters do not are divided into groups. The resulting code is a set of characters from the set 0BFHJKLMNPRSTWXY , at the beginning of a word there can also be vowels from the set AEIOU .

Metaphone Code Calculation Algorithm
  1. Delete all duplicate adjacent letters, with the exception of the letter C.
  2. Convert the beginning of a word according to the following rules:
    KN → N
    GN → N
    PN → N
    AE → E
    WR → R
  3. Delete the letter B at the end if it comes after M.
  4. Replace C according to the following rules
    On X: CIA → XIA, SCH → SKH, CH → XH
    On S: CI → SI, CE → SE, CY → SY
    On K: C → K
  5. Replace D according to the following rules
    On J: DGE → JGE, DGY → JGY, DGI → JGY
    On T: D → T
  6. Replace GH → H if this letter combination is not at the end and not before the vowel.
  7. Replace GN → N and GNED → NED if these letter combinations are at the end.
  8. Replace G according to the following rules
    On J: GI → JI, GE → JE, GY → JY
    by K: G → K
  9. Remove all H coming after the vowels, but not before the vowels.
  10. We carry out the following transformations according to the rules:
    CK → K
    PH → F
    Q → K
    V → F
    Z → S
  11. Replace S with X:
    SH → XH
    SIO → XIO
    SIA → XIA
  12. Replace T according to the following rules
    On X: TIA → XIA, TIO → XIO
    On 0: TH → 0
    Delete: TCH → CH
  13. At the beginning of the word, convert WH → W. If after W there is no vowel, then delete W.
  14. If X is at the beginning of a word, then convert X → S, otherwise X → KS
  15. Remove all Y that are not in front of the vowels.
  16. Delete all vowels except the initial one.

Examples

AKXN → Agashin, Akachenok, Akishin, Aksionenko, Aksionov, Akchunayev, Akshanov, Akshentsev, Akshinsky, Akshintsev, Akshonov.
FSLX → Vasilishin, Vasilchak, Vasilchenko, Vasilchik, Vasilchikov, Vasilchenko, Vasilchuk, Vasilyushchenko.
SRFM → Serafimov, Serafimsky, Serafimchuk, Tsereifman.

On average, 6 names have the same Metaphone code value.

Double metaphone


Double Metaphone (2000) is somewhat different from other phonetic algorithms, generating not one but two codes from the source word (both up to 4 characters long) - one reflects the main pronunciation of the word, and the other an alternative version. It has a large number of different rules, taking into account, among other things, the different origin of the words, paying attention to East European, Italian, Chinese words and so on. The transformation rules are quite numerous, I will not publish them, and those who wish can read about them in the article of Dr Dobbs magazine .

Examples

JXRFGuicharov .
KKRF → Gagarov, Kagarov, Kacharovsky, Kacherovsky , Kachurivsky, Kachurov, Kachurovsky, Kicherov, Kokarev, Kokurov, Kokourov, Kocharov, Kochurov, Kukarev, Tsakirov, Tsokurov, Tsugrov.
KXRFGisharov , Gocharov, Kacherov, Kacherovsky , Kasharevsky, Kocharov, Kochev, Kocheryaev, Kochuraev, Kosharev, Kosher.
PNFSBanovsky , Bakhnovsky, Binevsky, Binyavsky, Buinovsky, Buyanovsky, Panevsky, Panovsky, Panovsky, Penevsky, Pinevsky, Piunovsky, Pikhnovsky.

Double Metaphone matches an average of 8-9 last names to the same code.

Russian Metaphone


In 2002, the 8th issue of the Programmer magazine published an article by Peter Kankowski telling about his adaptation of the English version of the Metaphone algorithm to severe Siberian frosts, bears and balalaikas. This algorithm converts the source words in accordance with the rules and norms of the Russian language, taking into account the phonetic sound of unstressed vowels and possible "merging" of consonants in pronunciation. It shows very good results in practice, despite the fact that it is based on fairly simple rules. All letters are divided into sound groups - vowels and consonants ( vowels and consonants соответственно в английской терминологии), глухие и звонкие. Звонкие согласные преобразуются в соответствующие им парные глухие, объединяются «сливающиеся» при произношении последовательности букв, и проводятся некоторые другие манипуляции. Ниже я приведу немного доработанный вариант, который, в отличие от оригинала Петра Каньковски, привносит правила, связанные с фонетической эквивалентностью Ц и ТС или ДС, и не сжимает окончания — байты экономить — это не наша задача.

Алгоритм вычисления кода русского Metaphone
  1. Для всех гласных букв проделать следующие операции.
    ЙО, ИО, ЙЕ, ИЕ → И
    О, Ы, Я → А
    Е, Ё, Э → И
    Ю → У
  2. For all consonants followed by any consonant except L , M , H or P , or for consonants at the end of a word, stun:
    B → P S
    → S
    D → T
    B → F
    G → K
  3. Glue TS and DS in C :
    TS → C
As a result, the algorithm does its job very well - the result set contains really phonetically similar words. And at the same time, there are quite a few unnecessary words, mainly due to the fact that vowels are not ignored, but converted and used in the final code. However, there are some words that, despite their phonetic similarity, do not fall into the result set due to too “strict” algorithm rules.

In the case of Adolf Schwarzenegger, the result of the Russian Metaphone algorithm will be:

image

Thus, the algorithm in this case reflects the real phonetic similarity of the two names.

Examples

VITAFSKY → Vitavsky, Vitovsky.
VITINBIRK → Wittenberg, Wittenberg.
NASANAF → Nasanov, Nasonov, Nassonov, Nosonov.
PIRMAKAF → Permakov, Permyakov, Permyakov.

This algorithm converts on average 1-2 surnames to the same code.

Caverphone


The Caverphone algorithm was developed in 2002 as part of one of the New Zealand projects for comparing data in old and new electoral lists, therefore it is most focused on local pronunciation, although it gives quite acceptable results for Russian surnames.

Caverphone Code Calculation Algorithm
  1. Convert first or last name to lower case (the algorithm is case sensitive).
  2. Delete the letter e at the end.
  3. Convert the beginning of a word to the following table (relevant for local New Zealand names and surnames). In this case, the number 2 means the time stamp for the consonant, which will subsequently be deleted.
    coughroughtoughenoughgnmb
    cou2frou2ftou2fenou2f2nm2
  4. Replace characters in the following table:
    cqcicecytchcqxvdgtiotiadphbshz
    2qsisesy2chkkkf2gsiosiatfhps2s
  5. Replace all vowels at the beginning of the word with A , otherwise 3 . Thus, the number 3 serves as a time stamp for the vowel letter, which will be used in subsequent transformations, and then will be deleted. After that, it is necessary to make replacements according to the following tables (conventions: s + are several consecutive characters, ^ h is the character at the beginning of the line, w $ is the character at the end of the line):
    j^ y3^ yy3gh3ghgs +t +p +k +f +m +
    yY3A33kh322kSTPKFM
    n +w3wh3w $w^ hhr3r $rl3l $l
    NW3Wh332A2R332L332
  6. Delete all digits 2 . If at the end of the words remained number 3 , then replace it with A . Then delete all 3 digits .
  7. Trim the word to 10 characters, or add to 10 characters with units.

Examples

KPRLN11111 → Gabrelyan, Gabrielyan, Gabrielyan, Kaparulin, Kapralin, Kaprelyan.
MSRFK11111 → Meiserovich, Misarovich, Misyurevich.
PLLF111111 → Balalaev, Balaliev, Balaluev, Bilaliev, Bilalov, Bilyalov, Bolelov, Palilov, Polilov, Polulyakhov.

Caverphone matches about 4-5 names to the same code.

Total


Most of these algorithms are implemented in many languages, including C, C ++, Java, C # and PHP. Some of them, such as Soundex and Metaphone, are integrated or implemented as plug-ins for many popular DBMSs, and are also used as part of full-fledged search engines, for example, Apache Lucene . The scope of their application is quite specific, because a significant increase in user convenience can be achieved only by searching for names, but nevertheless their competent use is a plus for search engines.

References

  1. Java code for the article. Yandex.Disk
  2. Implementations of Soundex, Refined Soundex, Metaphone, Double Metaphone, Caverphone in Java.
    Apache commons codec
  3. NYSIIS implementation in Java. Egothor Project
  4. Реализация Daitch-Mokotoff Soundex на Java. http://joshualevy.tripod.com/genealogy/dmsoundex/dmsoundex.zip
  5. Описание Soundex. http://en.wikipedia.org/wiki/Soundex
  6. Описание Daitch-Mokotoff Soundex. http://www.jewishgen.org/infofiles/soundex.html
  7. Описание NYSIIS.
    http://en.wikipedia.org/wiki/New_York_State_Identification_and_Intelligence_System
  8. Описание Metaphone. http://en.wikipedia.org/wiki/Metaphone
  9. Описание Double Metaphone. http://www.drdobbs.com/article/184401251
  10. Описание русского Metaphone. http://forum.aeroion.ru/topic461.html
  11. Описание Caverphone. http://en.wikipedia.org/wiki/Caverphone
  12. Онлайн-демо Soundex. http://www.gedpage.com/soundex.html
  13. NYSIIS Online Demo. http://www.dropby.com/NYSIIS.html
  14. Online demo of Daitch-Mokotoff Soundex. http://stevemorse.org/census/soundex.html
  15. Metaphone Online Demo. http://www.searchforancestors.com/utility/metaphone.php

Also popular now: