Compare similar strings

The task of comparing similar strings is often encountered in practice: I personally recently encountered it when trying to import email addresses from one program to another.

For example, one address may look like “Tverskaya oblast, Kashin g, Sovetskaya ul, 1, 5”, and another - like “Tver oblast; the city of Kashin; Sovetskaya street; House number 1; apartment 5". Are these lines similar and how much? Sure, similar. And with the "naked eye" their structure is visible: region - settlement - street - house - apartment. It is logical that for addresses such a breakdown of lines into groups is important; that is, we should not compare “two porridges” from similar words (where one “porridge” consists of the words of the first line, and the second of the words of the second), but rather, carry out “group” comparison of words from the first line with words from the second. The criterion for splitting into groups is also examined: in the first line, the separator of the groups is “,”, and in the second - “; ".

At the same time, there are lines where no explicit grouping is visible. For example, take the “classics”: “When there is no agreement in the comrades, their work will not work out, and it will not work out - just flour.” And the second line: “Prankish monkey, Donkey, Goat and clumsy Bear started playing a quartet. »Obviously the lines are different (and even the moral of these fables is different, although parallels can be found).

The task in question is not new. There are algorithms (sometimes very complex) that try to solve it, and even sometimes successfully solve it. I propose one more algorithm box. When compiling it, I proceeded from the following principles:

  • simplicity of calling the comparison function;
  • ease of implementation;
  • sufficient versatility.

In addition, the algorithm is implemented on VBA Excel, therefore it is very “democratic” and it can be used everywhere: Excel not only exists among the software of various computers “by itself”, but it also exports data from various DBMSs and applications.

So, let's begin.

The comparison function is called StrCompare. It will have 4 arguments, two of which are optional: the first line str1, the second line str2, the group separator of the first line div1 and the group separator of the second line div2. If div1 or div2 are omitted, then the default separator is “|”. “|” Was chosen because it is unlikely to occur in the “average” line, and therefore can be used to compare monolithic (not groupable) lines. Such monolithic strings can also be considered strings consisting of one group. That is, the header of the comparison function looks like this:

Public Function StrCompare(str1 As String, str2 As String, Optional div1 As String = "|", Optional div2 As String = "|") As Single

Single - because the result of the function is a number showing the degree of similarity of the compared strings.

All groups of line 1 are sequentially compared with all groups of line 2 word by word, and the number of word matches in each pair of groups is considered. For each group of line 1, the “best group” from line 2 is finally selected (that is, the group with the most matches). Matches for each pair of words are checked for a word with a minimum length: that is, "street = street", and "g = city". This rule does not apply to numbers: i.e. 200 <> 20. When highlighting words, all the “insignificant characters” within the groups are just word separators, but they themselves are ignored, that is, words can consist only of characters WordSymbols = "0123456789ABBGDEJEZLKLMNOPRSTUFKHCHSHCHYSYEYABCDEFGHIJKLMNOPYRXUVXWVX. It is understood that case is not taken into account.

To search for a matching word in the current group of the second line, the fast half division method is used (but slightly modernized compared to the “classic” one, since matches are checked using the above method). And since the operation of the half division method requires sorted arrays, the fast sorting algorithm is also used.

The result of the StrCompare function will be the result of dividing the number of matching words by the total number of words in lines 1 and 2:

StrCompare = (da * 2) / ((kon1_2 - nach1_2 + 1) * (kon1_1 - nach1_1 + 1) + (kon2_2 - nach2_2 + 1) * (kon2_1 - nach2_1 + 1))

Here, for example, kon1_2 is the final boundary of array 1 (an array of words contained in the groups of the first row) in the 2nd dimension (the 1st dimension is the number of groups, and the 2nd is the number of words in the group).

It is time to introduce the code:

'Функция "интеллектуального" сравнения двух строк. Аргументы:
'строка1, строка2, разделители1, разделители2
Public Function StrCompare(str1 As String, str2 As String, Optional div1 As String = "|", Optional div2 As String = "|") As Single
WordSymbols = "0123456789АБВГДЕЁЖЗИЙКЛМНОПРСТУФХЦЧШЩЪЫЬЭЮЯABCDEFGHIJKLMNOPQRSTUVWXYZ"
Dim massiv1() As String, massiv2() As String, mass1() As String, mass2() As String, m1() As Variant, m2() As Variant 'одномерные массивы групп и двумерные массивы слов
Dim mm1() As String, mm2() As String
Dim nach1_1 As Integer, kon1_1 As Integer, nach1_2 As Integer, kon1_2 As Integer, nach2_1 As Integer, kon2_1 As Integer, nach2_2 As Integer, kon2_2 As Integer
Dim item As String, itemnumber As Integer
Dim yes As Integer, maxyes As Integer, da As Integer
Dim counter As Integer 'счетчик noname
str1 = UCase(str1): str2 = UCase(str2)
massiv1 = Split(str1, div1)
ReDim mass1(LBound(massiv1) To UBound(massiv1), 0 To 1000)
maxk = 0
counter = 0
For i = LBound(massiv1) To UBound(massiv1)
 item = massiv1(i)
 dlina = Len(item)
 slovo = ""
 NewWord = False
 k = 0 'второй индекс для массива слов
 For j = 1 To dlina
  bukva = mid(item, j, 1)
  If (InStr(1, WordSymbols, bukva) > 0) And Not NewWord Then
   NewWord = True
   slovo = slovo + bukva
  Else
   If InStr(1, WordSymbols, bukva) > 0 Then
    slovo = slovo + bukva
   Else
    If (InStr(1, WordSymbols, bukva) = 0) And NewWord Then
     NewWord = False
     mass1(i, k) = slovo
     If k > maxk Then maxk = k
     k = k + 1
     slovo = ""
    End If
   End If
  End If
 Next j
 If NewWord Then
  mass1(i, k) = slovo
  If k > maxk Then maxk = k
 End If
Next i
ReDim Preserve mass1(LBound(massiv1) To UBound(massiv1), 0 To maxk)
'*************************************************************'
massiv2 = Split(str2, div2)
ReDim mass2(LBound(massiv2) To UBound(massiv2), 0 To 1000)
maxk = 0
For i = LBound(massiv2) To UBound(massiv2)
 item = massiv2(i)
 dlina = Len(item)
 slovo = ""
 NewWord = False
 k = 0 'второй индекс для массива слов
 For j = 1 To dlina
  bukva = mid(item, j, 1)
  If (InStr(1, WordSymbols, bukva) > 0) And Not NewWord Then
   NewWord = True
   slovo = slovo + bukva
  Else
   If InStr(1, WordSymbols, bukva) > 0 Then
    slovo = slovo + bukva
   Else
    If (InStr(1, WordSymbols, bukva) = 0) And NewWord Then
     NewWord = False
     mass2(i, k) = slovo
     If k > maxk Then maxk = k
     k = k + 1
     slovo = ""
    End If
   End If
  End If
 Next j
 If NewWord Then
  mass2(i, k) = slovo
  If k > maxk Then maxk = k
 End If
Next i
ReDim Preserve mass2(LBound(massiv2) To UBound(massiv2), 0 To maxk)
' а теперь непосредственно "гибкое" сравнение строк; пример: kon1_2 - конечная граница массива 1 по 2-му измерению
nach1_1 = LBound(mass1, 1)
kon1_1 = UBound(mass1, 1)
nach1_2 = LBound(mass1, 2)
kon1_2 = UBound(mass1, 2)
nach2_1 = LBound(mass2, 1)
kon2_1 = UBound(mass2, 1)
nach2_2 = LBound(mass2, 2)
kon2_2 = UBound(mass2, 2)
For i = nach1_1 To kon1_1
 For j = nach1_2 To kon1_2
  If mass1(i, j) = "" Then
   counter = counter + 1
   mass1(i, j) = "noname" + Trim(Str(counter))
  End If
  'MsgBox ("mass1(" + Trim(Str(i)) + "," + Trim(Str(j)) + ")=" + mass1(i, j))
 Next j
Next i
For i = nach2_1 To kon2_1
 For j = nach2_2 To kon2_2
  If mass2(i, j) = "" Then
   counter = counter + 1
   mass2(i, j) = "noname" + Trim(Str(counter))
  End If
  'MsgBox ("mass2(" + Trim(Str(i)) + "," + Trim(Str(j)) + ")=" + mass2(i, j))
 Next j
Next i
'сортировка "внутренних массивов-групп"
ReDim m2(nach2_2 To kon2_2) As Variant
For i = nach2_1 To kon2_1
 For j = nach2_2 To kon2_2
  m2(j) = mass2(i, j)
 Next j
 Call QuickSort(m2, nach2_2, kon2_2)
 For j = nach2_2 To kon2_2
  mass2(i, j) = m2(j)
 Next j
Next i
'а теперь непосредственно само сравнение: каждая группа строки1 сравнивается с каждой группой строки2 почленно
ReDim mm2(nach2_2 To kon2_2)
da = 0
For k = nach1_1 To kon1_1 'цикл по группам строки1
 maxyes = 0
 For i = nach2_1 To kon2_1 'цикл по группам строки2
  yes = 0
  For j = nach2_2 To kon2_2: mm2(j) = mass2(i, j): Next j 'цикл по членам конкретной группы строки2
  For l = nach1_2 To kon1_2 'цикл по членам конкретной группы строки1
   If BinarySearch(mm2, nach2_2, kon2_2, mass1(k, l)) <> -1 Then yes = yes + 1
  Next l
  If yes > maxyes Then maxyes = yes
 Next i
 da = da + maxyes
Next k
StrChange = (da * 2) / ((kon1_2 - nach1_2 + 1) * (kon1_1 - nach1_1 + 1) + (kon2_2 - nach2_2 + 1) * (kon2_1 - nach2_1 + 1))
'StrChange = da
End Function
Public Sub QuickSort(ByRef vArray() As Variant, inLow As Integer, inHi As Integer)
  Dim pivot   As Variant
  Dim tmpSwap As Variant
  Dim tmpLow  As Integer
  Dim tmpHi   As Integer
  tmpLow = inLow
  tmpHi = inHi
  pivot = vArray((inLow + inHi) \ 2)
  While (tmpLow <= tmpHi)
     While (vArray(tmpLow) < pivot And tmpLow < inHi)
        tmpLow = tmpLow + 1
     Wend
     While (pivot < vArray(tmpHi) And tmpHi > inLow)
        tmpHi = tmpHi - 1
     Wend
     If (tmpLow <= tmpHi) Then
        tmpSwap = vArray(tmpLow)
        vArray(tmpLow) = vArray(tmpHi)
        vArray(tmpHi) = tmpSwap
        tmpLow = tmpLow + 1
        tmpHi = tmpHi - 1
     End If
  Wend
  If (inLow < tmpHi) Then QuickSort vArray, inLow, tmpHi
  If (tmpLow < inHi) Then QuickSort vArray, tmpLow, inHi
End Sub
Public Function BinarySearch(vArray() As String, inLow As Integer, inHi As Integer, key As String) As Integer
  Dim lev As Integer, prav As Integer, mid As Integer
  Dim key_ As String, arritem As String, arritem_ As String
  Dim minlen As Integer, keylen As Integer, arritemlen As Integer
  If key = Trim(Str(Val(key))) Then 'это число
  lev = inLow: prav = inHi
  While lev <= prav
   mid = lev + (prav - lev) \ 2
   arritem = vArray(mid)
   If key < arritem Then
    prav = mid - 1
   ElseIf key > arritem Then
    lev = mid + 1
   Else
    BinarySearch = mid
    Exit Function
   End If
  Wend
  Else
  keylen = Len(key)
  lev = inLow
  prav = inHi
  While lev <= prav
   mid = lev + (prav - lev) \ 2
   arritem = vArray(mid)
   arritemlen = Len(arritem)
   minlen = IIf(keylen < arritemlen, keylen, arritemlen)
   key_ = left(key, minlen)
   arritem_ = left(arritem, minlen)
   If key_ < arritem_ Then
    prav = mid - 1
   ElseIf key_ > arritem_ Then
    lev = mid + 1
   Else
    BinarySearch = mid
    Exit Function
   End If
  Wend
  End If
  BinarySearch = -1
End Function

There is no point in commenting everything, I think: you can navigate by the code. Just analyze the operation of the comparison function on several lines of different nature.

  1. str1 = ”Tver region., Kashin g, Sovetskaya street, 1, 5” str2 = ”Tver region; the city of Kashin; Sovetskaya street; House number 1; apartment 5".
    First, compare the strings without considering the groups:
    StrCompare (str1, str2) gives the result 0.8888889.
    And now, considering:
    StrCompare (str1, str2, ”,“, ”;“) - the result is 0.8.
    As you can see, groups are more strictly related to comparison; in this case, it is important for them that “the house is the house, and the apartment is the apartment”. When ignoring groups, this does not play a role.
  2. str1 = ”
    My grandmother lived a gray goat” str2 = ”Grandmother lived a gray goat” StrCompare (str1, str2) -> 0.6666667
  3. str1 = ”Ivanov Ivan Ivanovich m. Kaluga 1950 ”str2 =” Ivanov I.I.
    01/20/1950 ” StrCompare (str1, str2) -> 0.6153846
  4. str1 = ”When there is no agreement in the comrades, their work will not work out, and it will not work out - it's just flour.” str2 = ”Prankish monkey, Donkey, Goat and clubfoot Bear started to play a quartet.”
    StrCompare (str1, str2) -> 0
  5. str1 = ”In accordance with paragraph 1 of Art. 540 of the Civil Code of the Russian Federation in the case when a subscriber under a power supply agreement is a citizen using energy for domestic consumption, the contract is considered concluded from the moment the subscriber actually connects to the network. | According to Part 1 of Article 153 of the Housing Code of the Russian Federation, citizens are obliged to timely and fully pay fees for housing and utilities. | In the period from "____" _________ 2017 to "____" __________ 2017, the Guaranteeing supplier supplied you with electricity in the amount of ______________________. | In connection with your violation of your obligations to pay for electric energy, which led to the formation of a consumer’s debt to the guaranteeing supplier in the amount of more than 2 settlement periods in relation to the consumer’s premises, at the expense of the Guarantee’s supplier, actions were taken to limit / resume the provision of utility services for electricity supply. | In accordance with paragraph 121 (1) of the Rules for the provision of utility services to owners and users of premises in apartment buildings and residential buildings, approved by the Government Decision of 05.06.2011. No. 354, the expenses of the contractor associated with the introduction of the restriction, the suspension and resumption of the provision of utility services to the consumer-debtor, shall be reimbursed at the expense of the consumer in respect of whom the indicated actions were taken. supplier amount ______________________________________________. | Based on the foregoing, FE TverAtomEnergoSbyt asks you to pay the debt for actions to limit / resume the provision of utility services for electricity supply in the amount of _____________________ rubles. on the following details with the personal account number and the purpose of payment: ”
    str2 = ”“ ____ ”__________ 2017 between JSC AtomEnergoSbyt - the Guaranteeing Provider and _____________________ - The Consumer concluded an energy supply agreement No. ___________________, valid for _________________ years, with the condition of its further extension (clause 8.1 of the agreement, article 540 of the Civil Code of the Russian Federation ), in accordance with paragraph 1.1. which the guaranteeing supplier undertook to sell electric energy (power), as well as independently or through third parties to provide electric power transmission services and services, the provision of which is an integral part of the process of supplying electric energy to the consumer, and the Buyer undertakes to pay for the purchased electric energy (power) . | In connection with the violation by the Consumer of its obligations to pay for electric energy (paragraph 5.2. energy supply agreement No. __________________ dated ___________), which led to the formation of consumer debt to the guaranteeing supplier in the amount of more than one billing period, with respect to the consumer’s electric grid facilities _________________ actions were taken to limit / resume the energy consumption regime in accordance with the Full and (or) partial restriction of the mode of consumption of electric energy, approved by Decree of the Government of the Russian Federation dated 04.05.2012 No. 442 (hereinafter - the Rules). | According to paragraph 24 of the Rules, the consumer is obliged to compensate the contractor for the costs of payment of actions to introduce restrictions and the subsequent restoration of the regime of electric energy consumption. | The cost of the expenses of the Guaranteeing supplier for payment of actions to introduce a restriction and the subsequent restoration of the regime of electric energy consumption is ______________________________________________. | Based on the foregoing, OP TverAtomEnergoSbyt asks you to pay the costs of the Guaranteeing supplier for actions to limit / resume the regime of consumption of electric energy in the amount of _______________ rub. for the following details indicating the number of the contract and the purpose of payment: | Purpose of payment: payment of restrictions / resumption of the regime of consumption of electric energy under the contract No. ____________ ” OP “TverAtomEnergoSbyt” asks you to pay the costs of the Guaranteeing supplier for actions to limit / resume the regime of electric energy consumption in the amount of _______________ rubles. for the following details indicating the number of the contract and the purpose of payment: | Purpose of payment: payment of restrictions / resumption of the regime of consumption of electric energy under the contract No. ____________ ” OP “TverAtomEnergoSbyt” asks you to pay the costs of the Guaranteeing supplier for actions to limit / resume the regime of electric energy consumption in the amount of _______________ rubles. for the following details indicating the number of the contract and the purpose of payment: | Purpose of payment: payment of restrictions / resumption of the regime of consumption of electric energy under the contract No. ____________ ”
    Here str1 and str2 are fragments of very similar documents (energy supply agreements for individuals and legal entities, respectively). For a “rough assessment” of document similarity, you can use a comparison without the StrCompare groups (str1, str2, ”*”, ”*”) (the “|” character is not suitable in this case, because it is used in the original lines to break them groups), which reveals a decent resemblance of 0.75 (i.e., documents are clearly of the same nature!). And for concretization of similarities we use the partition into groups: StrCompare (str1, str2, ”|”, ”|”) (or just StrCompare (str1, str2)). Result: 0.3790227.

And now, perhaps the most interesting example. The plot of the fable about the crow and the fox has been known since Aesop's time. Using StrCompare, compare two fables: a classic version by I.A. Krylova and less known from A.P. Sumarokova:
str1 = ”How many times have they repeated to the world, That flattery is vile, harmful; but just not for the future, And in the heart the flatterer will always find a corner. Somewhere to Vorone, God sent a piece of cheese; Piling onto the Spruce Crow, Breakfast was quite ready, Yes, thoughtful, and I kept the cheese in my mouth. To that misfortune, the Fox ran close; Suddenly, the cheesy spirit of Fox stopped: Fox sees cheese, Fox captivated cheese. A tiptoe to the tree is tiptoeing; He twists his tail, does not take his eyes off the Crow, and speaks so sweetly, breathing a little: “Dear, how good! What a neck, what an eye! To tell, so, right, fairy tales! What pearshki! what a sock! And, true, the angel must have a voice! Sing, light, do not be ashamed! What if, sister, With such a beauty you’re a craftswoman singing, - if you would have been a king-bird! ”Veshchunina turned her head from praise, With joy, his breath closed in his craw,
str2 = ”And the birds keep to the craft of man. A crow once carried away a cheese to a cous And sat on an oak. Yes, just have not eaten a tiny bit. The Fox saw a piece in her mouth. And she thinks: “I will give the raven juice! Although I won’t get there, I’ll get this piece, Oak no matter how tall. ” “It's great,” says Fox, “Boyfriend, Funnel, named sister!” You are a beautiful bird! What kind of scissors, what a sock, And you can tell you without hypocrisy, What are you more than anything else, my little light, good! And the parrot is nothing in front of you, soul, More than a hundred times your peacock feathers are more beautiful! ”(Praise is not pleasant to us). “Oh, if you still knew how to sing, So there wouldn’t be a bird like you in the world!” The crow opened its neck wider to be a nightingale, “And to the cheese,” he thinks, “and then I eat.” At this very moment, it’s not a matter of feast here for me! ”I opened my mouth and waited for fasting. He only sees the end of the Lisitsyn tail.
The reason is that cheese is no longer there. The cheese fell out of the company, - Fox for lunch. ”

StrCompare (str1, str2) gives the result 0.5590062 - so there is a similarity between the plot!

Also popular now: