darthunix October 28, 2017 at 17:38

Looking for misspelled names in PostgreSQL

It all started with the fact that I needed to develop a patient search for one internal medical system. The logic of the work was that if we did not find a person in the system, then we need to create it (and duplicate patients cannot be produced). In this regard, one of the sub-tasks was the implementation of people search taking into account typos in their names. Well, since I love PostgreSQL (and when you have a hammer in your hands, everything looks like nails), it's not difficult to guess what I decided to implement a typo search on ...

Typically, this problem can be solved in two ways: using fuzzy search or phonetic algorithms. Being a lazy person and a holy believer that everything had long been ~~stolen~~ invented before us, I turned to the PostgreSQL documentation. PostgreSQL currently has two modules that can help with typos: pg_trgm and fuzzystrmatch .

pg_trgm works with trigrams; it can search by substring and fuzzy search. As indices it works with gist and gin .
fuzzystrmatch can count the Levenshtein distance between words and three phonetic algorithms: Soundex , Metaphone and Double Metaphone . The pitfalls are, firstly, that the Levenshtein function in this module does not allow you to create an index for an arbitrary search query. Secondly, all phonetic algorithms are implemented for the Latin alphabet.

In this regard, I started looking for a solution to the problem where it is lighter, namely from the pg_trgm module.

Trigrams

For simplicity of the model, consider the info table with the patient ID, his last name, first name and patronymic. Since we want gist / gin indexes, we first need to know an important but unpleasant moment: one gist / gin index - one column of the table. It cannot be created, for example, by concatenating several columns. Therefore, below the cat will be created:

pg_trgm extension
patient table storing the name in the form of jsonb (with checks for the existence and filling of keys)
immutable function for constructing a trigram index, converting the name from jsonb to text

Boring SQL code

create extension pg_trgm;
create table info (
  patid integer, fullname jsonb, 
  constraint info_pk primary key (patid),
  constraint fullname_exists check (
    fullname ? 'lname'::text and 
    fullname ? 'fname'::text and 
    fullname ? 'sname'::text
  ),
  constraint fullname_notnull check (
    (fullname ->> 'lname'::text) is not null and 
    (fullname ->> 'fname'::text) is not null
  )
);
create function fullname(in_fullname jsonb)
returns text
language plpgsql
immutable
as $$
begin
  return regexp_replace(
    lower(
      trim(
        coalesce(in_fullname->>'lname', '') || ' ' ||
        coalesce(in_fullname->>'fname', '') || ' ' ||
        coalesce(in_fullname->>'sname', '')
      )
    ),
    'ё',
    'е',
    'g'
   );
exception
  when others then raise exception '%', sqlerrm;
end;
$$;

We insert into the table about 300 thousand records of the full name and proceed.

Trigrams and GIST

So, first we check the gist index for fuzzy trigram search.

Even more boring SQL code

create index info_gist_idx on info 
using gist (fullname(fullname) gist_trgm_ops);
CREATE INDEX
Time: 15054,102 ms 
explain (analyze, buffers) 
select patid, fullname(fullname) <-> 'смерно дени анато' as dist
from info order by dist limit 10; 
  Limit  (cost=0.28..4.35 rows=10 width=8) (actual time=157.378..157.688 rows=10 loops=1)
  Buffers: shared hit=5743
  ->  Index Scan using info_gist_idx on info  (cost=0.28..126822.96 rows=312084 width=8) (actual time=157.371..157.655 rows=10 loops=1)
        Order By: (fullname(fullname) <-> 'смерно дени анато'::text)
        Buffers: shared hit=5743
Planning time: 0.225 ms
Execution time: 158.223 ms
(7 rows)

The index was created 15 seconds, size 45 MB, search by an incomplete name with typos - 158 ms.

Trigrams and GIN

Next, consider the gin index for fuzzy trigram search.

Do you think previous SQL spoilers were boring?

create index info_trgm_idx on info 
using gin(fullname(fullname) gin_trgm_ops);
CREATE INDEX
Time: 10163,401 ms
explain (analyze, buffers) 
select patid, similarity(fullname(fullname), 'смерно дени анато' ) as sml
from info where true
and fullname(fullname) % 'смерно дени анато'
order by sml desc limit 10;
 Limit  (cost=1180.22..1180.25 rows=10 width=8) (actual time=133.086..133.117 rows=8 loops=1)
  Buffers: shared hit=5741
  ->  Sort  (cost=1180.22..1181.00 rows=312 width=8) (actual time=133.080..133.090 rows=8 loops=1)
        Sort Key: (similarity(fullname(fullname), 'смерно дени анато'::text)) DESC
        Sort Method: quicksort  Memory: 25kB
        Buffers: shared hit=5741
        ->  Bitmap Heap Scan on info  (cost=26.70..1173.48 rows=312 width=8) (actual time=132.828..133.048 rows=8 loops=1)
              Recheck Cond: (fullname(fullname) % 'смерно дени анато'::text)
              Heap Blocks: exact=7
              Buffers: shared hit=5741
              ->  Bitmap Index Scan on info_gist_idx  (cost=0.00..26.62 rows=312 width=0) (actual time=132.699..132.699 rows=8 loops=1)
                    Index Cond: (fullname(fullname) % 'смерно дени анато'::text)
                    Buffers: shared hit=5734
Planning time: 0.573 ms
Execution time: 133.225 ms
(15 rows)

Index creation 10 sec, size 18 MB, search by incomplete name with typos - 133 ms.

To be honest, the results are so-so - because I have a Chinese ssd disk from the glorious city of Shenzhen on my laptop. Therefore, we will try to speed up the process by combining fuzzy and full-text search.

Trigrams and full-text search

The idea is extremely simple - to collect from all spellings of surnames, names and patronymics a separate table-dictionary. First, we cut the input search string into tokens, look for each of the tokens in the dictionary table through a fuzzy search, select all possible spelling variants of each token from there, put them in tsquery and do a full-text search on the tsvector index of the info table. The benefit of this plan is that the speed of a fuzzy search by trigrams depends on the width of the line and their number in the column with text. Obviously, the full name dictionary will be more compact than the original column in the info table, which means that the search will be faster. There is only one drawback of the method - when you add each new patient, you will have to update the dictionary if the tokens from the name are not found in it. For verification we need to collect rum from sourceindex for constructing a tsvector index by name in the info table. Rum is a modified version of the gin index, which stores additional information in the leaves. In our case, we use the rum_tsvector_ops operator class, which stores positional information about the token in the index. Therefore, unlike gin, we can use an index-only tsquery query of the form

(смернов|смирнов|чернов)<->(денис)<->(анатольевич)

without going to the table for more information about the order of the token in the tuple. Moreover, the recommendation for gin is the physical existence of the tsvector column, since all found pointers to tuples will have to be double-checked in the table. And if physically there is no tsvector column (you built it with a function for the index), then for each tuple you will have to perform an additional tsvector calculation. In general, rum in this story will be much more productive.

Everest in the world of boring SQL

create extension rum;
create index info_rum_idx on info 
using rum (
  to_tsvector('simple'::regconfig, fullname(fullname)) rum_tsvector_ops
);
CREATE INDEX
Time: 7.545s (7 seconds)
create table patname (
  lex text,
  constraint patname_uniq_idx unique (lex)
);
create index patname_fuzzy_idx on patname using gin (lex gin_trgm_ops);
CREATE INDEX
Time: 0.596s
insert into patname (lex)
select word from ts_stat($$
  select to_tsvector('simple', fullname(fullname)) from info
$$);
explain (analyze, buffers)
with fio as (
  select lexeme as lex, positions[1] as pos 
  from unnest(to_tsvector('simple','смернов дин онатол'))
),
query as(
  select to_tsquery('simple', string_agg(q.tq,'&')) as q from (
    select f.pos,  '('||string_agg(p.lex,'|')||')' as tq from fio as f
    join patname as p on p.lex % f.lex
    group by f.pos
  ) as q
)
select to_tsvector('simple'::regconfig, fullname(fullname)) 
  <=> (select q from query) as rank, * from info
where to_tsvector('simple'::regconfig, fullname(fullname)) 
  @@ (select q from query)
order by to_tsvector('simple'::regconfig, fullname(fullname)) 
  <=> (select q from query);
 Sort  (cost=6453.71..6457.61 rows=1560 width=100) (actual time=68.201..68.202 rows=1 loops=1)
  Sort Key: ((to_tsvector('simple'::regconfig, fullname(info.fullname)) <=> $3))
  Sort Method: quicksort  Memory: 25kB
  Buffers: shared hit=536
  CTE fio
    ->  Function Scan on unnest  (cost=0.00..0.10 rows=10 width=34) (actual time=0.023..0.034 rows=3 loops=1)
  CTE query
    ->  Aggregate  (cost=1484.60..1484.86 rows=1 width=32) (actual time=11.829..11.830 rows=1 loops=1)
          Buffers: shared hit=325
          ->  HashAggregate  (cost=1484.30..1484.48 rows=10 width=34) (actual time=11.640..11.644 rows=2 loops=1)
                Group Key: f.pos
                Buffers: shared hit=325
                ->  Nested Loop  (cost=16.58..1480.53 rows=755 width=19) (actual time=2.940..11.442 rows=62 loops=1)
                      Buffers: shared hit=325
                      ->  CTE Scan on fio f  (cost=0.00..0.20 rows=10 width=34) (actual time=0.028..0.053 rows=3 loops=1)
                      ->  Bitmap Heap Scan on patname p  (cost=16.58..147.28 rows=75 width=17) (actual time=1.905..3.717 rows=21 loops=3)
                            Recheck Cond: (lex % f.lex)
                            Rows Removed by Index Recheck: 321
                            Heap Blocks: exact=275
                            Buffers: shared hit=325
                            ->  Bitmap Index Scan on patname_fuzzy_idx  (cost=0.00..16.57 rows=75 width=0) (actual time=1.277..1.277 rows=342 loops=3)
                                  Index Cond: (lex % f.lex)
                                  Buffers: shared hit=50
  InitPlan 3 (returns $3)
    ->  CTE Scan on query  (cost=0.00..0.02 rows=1 width=32) (actual time=0.004..0.006 rows=1 loops=1)
  InitPlan 4 (returns $4)
    ->  CTE Scan on query query_1  (cost=0.00..0.02 rows=1 width=32) (actual time=11.834..11.839 rows=1 loops=1)
          Buffers: shared hit=325
  ->  Bitmap Heap Scan on info  (cost=31.99..4885.97 rows=1560 width=100) (actual time=68.184..68.187 rows=1 loops=1)
        Recheck Cond: (to_tsvector('simple'::regconfig, fullname(fullname)) @@ $4)
        Heap Blocks: exact=1
        Buffers: shared hit=536
        ->  Bitmap Index Scan on info_rum_idx  (cost=0.00..31.60 rows=1560 width=0) (actual time=67.847..67.847 rows=1 loops=1)
              Index Cond: (to_tsvector('simple'::regconfig, fullname(fullname)) @@ $4)
              Buffers: shared hit=517
Planning time: 5.012 ms
Execution time: 68.583 ms
(37 rows)

In total, a full-text search index was created for 7 seconds (size 13 MB), a token dictionary index was created in 0.6 seconds (size 5.8 MB), and search was 68 ms. Of the disadvantages, the selectivity is worse than the previous options.

Phonetic Algorithms

Having tried the fuzzy search options from the pg_trmg module, I decided to look again at fuzzystrmatch. I did not come up with how to index the Levenshtein function, but the phonetic algorithms were extremely interesting to me. As mentioned above, out of the box in PostgreSQL, phonetic functions are implemented only for the Latin alphabet and are imprisoned for English names. An Internet search of their Russian implementations led me to a wonderful article.on Habré, where the working Metaphone algorithm for Russian names (consisting of Russian letters) was described. Only one thing saddened - although it was simple, it was somehow completely sad to implement this logic on plpgsql, or something on some Python ... And then I remembered that plpython3u is unsafe (functions on it can access file system with the permissions of the postgres process), but a perfectly working language in PostgreSQL. And it would be a sin not to use it. Therefore, I wrote two immutable functions:

phoneme on plpython3u, which turns a token into a phoneme ("smirnov" into "smirnaf") according to the algorithm from an article in Habré
metaphone on plpgsql, which turns not just one token into a phoneme, but a whole text into a set of phonemes. In fact, it's just a binding over the phoneme function.

Metaphone and btree

Next, try to create a regular btree index on the metaphone function and evaluate the speed.

Pass by, there is nothing interesting

create or replace function phoneme (in_lexeme text)
returns text
language plpython3u
immutable
as $$
import re
class Lexeme:
    def __init__(self, body):
        """
        :type body: str
        """
        self.body = body.lower().strip()
        # Правила замены гласных
        self._vowels = {"(?:йо|ио|йе|ие)": "и", "[оыя]": "а", "[еёэ]": "и", "[ю]": "у"}
        # Правила оглушения согласных
        self._consonants = {"б": "п", "з": "с", "д": "т", "в": "ф"}
        # Согласная должна быть оглушена, если за ней следует один из символов списка _deafening_chars
        self._deafening_chars = ["б", "в", "г", "д", "ж", "з", "й"]
        # Удаляемые символы
        self._removable_chars = {"[ъь]": ""}
    def _remove_double_chars(self):
        return Lexeme("".join((char for num, char in enumerate(self.body) if char != self.body[num - 1])))
    def _deafen_consonants(self):
        modified_body = ""
        for num, char in enumerate(self.body):
            if char in self._consonants and (
                num < len(self.body) - 1 and self.body[num + 1] in self._deafening_chars
                or num == len(self.body) - 1
            ):
                modified_body += self._consonants[char]
            else:
                modified_body += char
        return Lexeme(modified_body)
    @staticmethod
    def _regexp_replace(text, char_dict):
        modified_body = text
        for item in char_dict:
            modified_body = re.sub(item, char_dict[item], modified_body)
        return Lexeme(modified_body)
    def _replace_vowels(self):
        return self._regexp_replace(self.body, self._vowels)
    def _remove_chars(self):
        return self._regexp_replace(self.body, self._removable_chars)
    def metaphone(self):
        return self._remove_chars()._replace_vowels()._deafen_consonants()._remove_double_chars().body
return Lexeme(in_lexeme).metaphone()
$$;
create or replace function metaphone (in_phonemes text)
returns text
language plpgsql
immutable
as $$
begin
  return (
    select string_agg(q.lex,' ') from (
      select phoneme(lexeme) as lex from unnest(to_tsvector('simple', in_phonemes))
      order by positions
    ) as q
  );
exception when others then
  raise '%', SQLERRM using errcode = SQLSTATE;
end;
$$;
create index info_metaphone_idx on info (
  metaphone(fullname(fullname)) text_pattern_ops
);
CREATE INDEX
Time: 114.757s (a minute)
explain (analyze, buffers)
select patid, fullname from info 
where metaphone(fullname(fullname)) like 
regexp_replace(metaphone('смернов дин онатол'),'\s','%','g')||'%' 
limit 10;
 Limit  (cost=76.03..1388.96 rows=10 width=96) (actual time=22.452..129.944 rows=3 loops=1)
  Buffers: shared hit=239
  ->  Bitmap Heap Scan on info  (cost=76.03..4146.10 rows=31 width=96) 
         (actual time=22.447..129.927 rows=3 loops=1) 
        Filter: (metaphone(fullname(fullname)) ~~ 'смирнаф%дин%анатал%'::text)
        Rows Removed by Filter: 244
        Heap Blocks: exact=234
        Buffers: shared hit=239
        ->  Bitmap Index Scan on info_metaphone_idx  (cost=0.00..76.02 rows=1560 width=0) 
               (actual time=0.061..0.061 rows=247 loops=1) 
              Index Cond: ((metaphone(fullname(fullname)) ~>=~ 'смирнаф'::text) AND 
                (metaphone(fullname(fullname)) ~<~ 'смирнах'::text))
              Buffers: shared hit=5
Planning time: 1.012 ms
Execution time: 129.977 ms
(12 rows)
Time: 131,802 ms

The index was created for 114 seconds, size 22 MB (it seems that I wrote not the most optimal function on python in terms of performance), the request is 131 ms. The index works only on a small part of the substring, and then the filter works because of the "%". Poorly.

Metaphone and trigrams

Let's try to build the trigram index based on the metaphone function created on plpython3u. But we will use it now not for a fuzzy search, but for searching a substring.

How to reduce the time of request folk remedies using ...

create index info_metaphone_trgm_idx on info 
using gin (metaphone(fullname(fullname)) gin_trgm_ops);
CREATE INDEX
Time: 124.713s (2 minutes)
explain (analyze, buffers)
select patid, fullname from info 
where metaphone(fullname(fullname)) like 
'%'||regexp_replace(metaphone('смернов дин онатол'),'\s','%','g')||'%' limit 10;
 Limit  (cost=92.24..134.98 rows=10 width=96) (actual time=9.562..10.638 rows=3 loops=1)
  Buffers: shared hit=103
  ->  Bitmap Heap Scan on info  (cost=92.24..224.74 rows=31 width=96) (actual time=9.554..10.617 rows=3 loops=1)
        Recheck Cond: (metaphone(fullname(fullname)) ~~ '%смирнаф%дин%анатал%'::text)
        Heap Blocks: exact=2
        Buffers: shared hit=103
        ->  Bitmap Index Scan on info_metaphone_trgm_idx  (cost=0.00..92.23 rows=31 width=0) (actual time=8.354..8.354 rows=3 loops=1)
              Index Cond: (metaphone(fullname(fullname)) ~~ '%смирнаф%дин%анатал%'::text)
              Buffers: shared hit=101
Planning time: 2.029 ms
Execution time: 10.726 ms
(11 rows)
Time: 14,480 ms

Index creation time - 124 sec, size 15 Mb (hello my crooked hands and plpython3u), search - 14 ms.

Summary

Let's summarize the various search options with typos
UPDATE 1: added the implementation of Metaphone from movEAX .
UPDATE 2: added Metaphone implementation on plpgsql from Ivan Milovanov (telegram - milovanov)

Metaphone on plpgsql


create or replace function phoneme (in_lexeme text)
returns text
language plpgsql
immutable
as $$
declare
  res varchar(100) DEFAULT '';
begin
  res := lower(in_lexeme);
  res := regexp_replace(res,'[ъь]','','g');
  res := regexp_replace(res,'(йо|ио|йе|ие)','и','g');
  res := regexp_replace(res,'[оыя]','а','g');
  res := regexp_replace(res,'[еёэ]','и','g');
  res := regexp_replace(res,'ю','у','g');
  res := regexp_replace(res,'б([псткбвгджзфхцчшщ]|$)','п\1','g');
  res := regexp_replace(res,'з([псткбвгджзфхцчшщ]|$)','с\1','g');
  res := regexp_replace(res,'д([псткбвгджзфхцчшщ]|$)','т\1','g');
  res := regexp_replace(res,'в([псткбвгджзфхцчшщ]|$)','ф\1','g');
  res := regexp_replace(res,'г([псткбвгджзфхцчшщ]|$)','к\1','g');  
  res := regexp_replace(res,'дс','ц','g');
  res := regexp_replace(res,'тс','ц','g');
  res := regexp_replace(res,'(.)\1','\1','g');
  return res;
exception
  when others then raise exception '%', sqlerrm;
end;
$$;

Type of search	Index Creation Time	Index size	Typo Search Speed	Remarks
Trigrams gist	15 sec	45 Mb	158 ms
Gin trigrams	10 sec	18 Mb	133 ms
Trigrams and full-text search	7.6 sec	18.8 Mb	68 ms	Worse selectivity, you need to maintain a dictionary of tokens
Metaphone btree	114 sec	22 Mb	131 ms	Insecure language plpython3u
Metaphone Trigrams	124 sec	15 Mb	14 ms	Insecure language plpython3u
Implementing Metaphone Trigrams from movEAX	77.8 sec	16 Mb	14 ms	Insecure language plpython3u
Implementation of Ivan Milovanov at plpgsql	72.0 sec	16 Mb	14 ms

UPDATE 3: when the index contains “smirnaf dinis anatalievich” , the letter “in” in the middle name is not stunned (since there is a vowel after it). If you look at the substring metaphone ('Anatole') , then the letter “c” is not over the vowel, but at the end it will be stunned. To get around this problem, the mquery function is written below and the search is performed by the expression

select metaphone('смирнов денис анатольевич') similar to mquery('Смернов дини онатольев');

Mqery function


create or replace function mquery(in_fullname text)
returns text
language plpgsql
immutable
as $$
declare
  res text;
begin
  res := metaphone(in_fullname);
  res := regexp_replace(res, '(б|п)', '(б|п)', 'g');
  res := regexp_replace(res, '(з|с)', '(з|с)', 'g');
  res := regexp_replace(res, '(д|т)', '(д|т)', 'g');
  res := regexp_replace(res, '(в|ф)', '(в|ф)', 'g');
  res := regexp_replace(res, '(г|к)', '(г|к)', 'g');
  res := regexp_replace(res, '\s', '%', 'g');
  return '%'||res||'%';
exception
  when others then raise exception '%', sqlerrm;
end;
$$;

Since in my case the system will be focused on reading rather than writing (maximum adding a pair of patients per minute), then my option is Metaphone with trigrams. If someone has ideas on how to optimize a function in Python in terms of speed, then unsubscribe in the comments, I will add data to the tests.

Tags: