Looking for misspelled names in PostgreSQL

    It all started with the fact that I needed to develop a patient search for one internal medical system. The logic of the work was that if we did not find a person in the system, then we need to create it (and duplicate patients cannot be produced). In this regard, one of the sub-tasks was the implementation of people search taking into account typos in their names. Well, since I love PostgreSQL (and when you have a hammer in your hands, everything looks like nails), it's not difficult to guess what I decided to implement a typo search on ...



    Typically, this problem can be solved in two ways: using fuzzy search or phonetic algorithms. Being a lazy person and a holy believer that everything had long been stolen invented before us, I turned to the PostgreSQL documentation. PostgreSQL currently has two modules that can help with typos: pg_trgm and fuzzystrmatch .

    1. pg_trgm works with trigrams; it can search by substring and fuzzy search. As indices it works with gist and gin .
    2. fuzzystrmatch can count the Levenshtein distance between words and three phonetic algorithms: Soundex , Metaphone and Double Metaphone . The pitfalls are, firstly, that the Levenshtein function in this module does not allow you to create an index for an arbitrary search query. Secondly, all phonetic algorithms are implemented for the Latin alphabet.

    In this regard, I started looking for a solution to the problem where it is lighter, namely from the pg_trgm module.

    Trigrams


    For simplicity of the model, consider the info table with the patient ID, his last name, first name and patronymic. Since we want gist / gin indexes, we first need to know an important but unpleasant moment: one gist / gin index - one column of the table. It cannot be created, for example, by concatenating several columns. Therefore, below the cat will be created:

    • pg_trgm extension
    • patient table storing the name in the form of jsonb (with checks for the existence and filling of keys)
    • immutable function for constructing a trigram index, converting the name from jsonb to text

    Boring SQL code
    create extension pg_trgm;
    create table info (
      patid integer, fullname jsonb, 
      constraint info_pk primary key (patid),
      constraint fullname_exists check (
        fullname ? 'lname'::text and 
        fullname ? 'fname'::text and 
        fullname ? 'sname'::text
      ),
      constraint fullname_notnull check (
        (fullname ->> 'lname'::text) is not null and 
        (fullname ->> 'fname'::text) is not null
      )
    );
    create function fullname(in_fullname jsonb)
    returns text
    language plpgsql
    immutable
    as $$
    begin
      return regexp_replace(
        lower(
          trim(
            coalesce(in_fullname->>'lname', '') || ' ' ||
            coalesce(in_fullname->>'fname', '') || ' ' ||
            coalesce(in_fullname->>'sname', '')
          )
        ),
        'ё',
        'е',
        'g'
       );
    exception
      when others then raise exception '%', sqlerrm;
    end;
    $$;


    We insert into the table about 300 thousand records of the full name and proceed.

    Trigrams and GIST


    So, first we check the gist index for fuzzy trigram search.

    Even more boring SQL code
    create index info_gist_idx on info 
    using gist (fullname(fullname) gist_trgm_ops);
    CREATE INDEX
    Time: 15054,102 ms 
    explain (analyze, buffers) 
    select patid, fullname(fullname) <-> 'смерно дени анато' as dist
    from info order by dist limit 10; 
      Limit  (cost=0.28..4.35 rows=10 width=8) (actual time=157.378..157.688 rows=10 loops=1)
      Buffers: shared hit=5743
      ->  Index Scan using info_gist_idx on info  (cost=0.28..126822.96 rows=312084 width=8) (actual time=157.371..157.655 rows=10 loops=1)
            Order By: (fullname(fullname) <-> 'смерно дени анато'::text)
            Buffers: shared hit=5743
    Planning time: 0.225 ms
    Execution time: 158.223 ms
    (7 rows)
    


    The index was created 15 seconds, size 45 MB, search by an incomplete name with typos - 158 ms.

    Trigrams and GIN


    Next, consider the gin index for fuzzy trigram search.

    Do you think previous SQL spoilers were boring?
    create index info_trgm_idx on info 
    using gin(fullname(fullname) gin_trgm_ops);
    CREATE INDEX
    Time: 10163,401 ms
    explain (analyze, buffers) 
    select patid, similarity(fullname(fullname), 'смерно дени анато' ) as sml
    from info where true
    and fullname(fullname) % 'смерно дени анато'
    order by sml desc limit 10;
     Limit  (cost=1180.22..1180.25 rows=10 width=8) (actual time=133.086..133.117 rows=8 loops=1)
      Buffers: shared hit=5741
      ->  Sort  (cost=1180.22..1181.00 rows=312 width=8) (actual time=133.080..133.090 rows=8 loops=1)
            Sort Key: (similarity(fullname(fullname), 'смерно дени анато'::text)) DESC
            Sort Method: quicksort  Memory: 25kB
            Buffers: shared hit=5741
            ->  Bitmap Heap Scan on info  (cost=26.70..1173.48 rows=312 width=8) (actual time=132.828..133.048 rows=8 loops=1)
                  Recheck Cond: (fullname(fullname) % 'смерно дени анато'::text)
                  Heap Blocks: exact=7
                  Buffers: shared hit=5741
                  ->  Bitmap Index Scan on info_gist_idx  (cost=0.00..26.62 rows=312 width=0) (actual time=132.699..132.699 rows=8 loops=1)
                        Index Cond: (fullname(fullname) % 'смерно дени анато'::text)
                        Buffers: shared hit=5734
    Planning time: 0.573 ms
    Execution time: 133.225 ms
    (15 rows)


    Index creation 10 sec, size 18 MB, search by incomplete name with typos - 133 ms.

    To be honest, the results are so-so - because I have a Chinese ssd disk from the glorious city of Shenzhen on my laptop. Therefore, we will try to speed up the process by combining fuzzy and full-text search.

    Trigrams and full-text search


    The idea is extremely simple - to collect from all spellings of surnames, names and patronymics a separate table-dictionary. First, we cut the input search string into tokens, look for each of the tokens in the dictionary table through a fuzzy search, select all possible spelling variants of each token from there, put them in tsquery and do a full-text search on the tsvector index of the info table. The benefit of this plan is that the speed of a fuzzy search by trigrams depends on the width of the line and their number in the column with text. Obviously, the full name dictionary will be more compact than the original column in the info table, which means that the search will be faster. There is only one drawback of the method - when you add each new patient, you will have to update the dictionary if the tokens from the name are not found in it. For verification we need to collect rum from sourceindex for constructing a tsvector index by name in the info table. Rum is a modified version of the gin index, which stores additional information in the leaves. In our case, we use the rum_tsvector_ops operator class, which stores positional information about the token in the index. Therefore, unlike gin, we can use an index-only tsquery query of the form
    (смернов|смирнов|чернов)<->(денис)<->(анатольевич) 
    without going to the table for more information about the order of the token in the tuple. Moreover, the recommendation for gin is the physical existence of the tsvector column, since all found pointers to tuples will have to be double-checked in the table. And if physically there is no tsvector column (you built it with a function for the index), then for each tuple you will have to perform an additional tsvector calculation. In general, rum in this story will be much more productive.

    Everest in the world of boring SQL
    create extension rum;
    create index info_rum_idx on info 
    using rum (
      to_tsvector('simple'::regconfig, fullname(fullname)) rum_tsvector_ops
    );
    CREATE INDEX
    Time: 7.545s (7 seconds)
    create table patname (
      lex text,
      constraint patname_uniq_idx unique (lex)
    );
    create index patname_fuzzy_idx on patname using gin (lex gin_trgm_ops);
    CREATE INDEX
    Time: 0.596s
    insert into patname (lex)
    select word from ts_stat($$
      select to_tsvector('simple', fullname(fullname)) from info
    $$);
    explain (analyze, buffers)
    with fio as (
      select lexeme as lex, positions[1] as pos 
      from unnest(to_tsvector('simple','смернов дин онатол'))
    ),
    query as(
      select to_tsquery('simple', string_agg(q.tq,'&')) as q from (
        select f.pos,  '('||string_agg(p.lex,'|')||')' as tq from fio as f
        join patname as p on p.lex % f.lex
        group by f.pos
      ) as q
    )
    select to_tsvector('simple'::regconfig, fullname(fullname)) 
      <=> (select q from query) as rank, * from info
    where to_tsvector('simple'::regconfig, fullname(fullname)) 
      @@ (select q from query)
    order by to_tsvector('simple'::regconfig, fullname(fullname)) 
      <=> (select q from query);
     Sort  (cost=6453.71..6457.61 rows=1560 width=100) (actual time=68.201..68.202 rows=1 loops=1)
      Sort Key: ((to_tsvector('simple'::regconfig, fullname(info.fullname)) <=> $3))
      Sort Method: quicksort  Memory: 25kB
      Buffers: shared hit=536
      CTE fio
        ->  Function Scan on unnest  (cost=0.00..0.10 rows=10 width=34) (actual time=0.023..0.034 rows=3 loops=1)
      CTE query
        ->  Aggregate  (cost=1484.60..1484.86 rows=1 width=32) (actual time=11.829..11.830 rows=1 loops=1)
              Buffers: shared hit=325
              ->  HashAggregate  (cost=1484.30..1484.48 rows=10 width=34) (actual time=11.640..11.644 rows=2 loops=1)
                    Group Key: f.pos
                    Buffers: shared hit=325
                    ->  Nested Loop  (cost=16.58..1480.53 rows=755 width=19) (actual time=2.940..11.442 rows=62 loops=1)
                          Buffers: shared hit=325
                          ->  CTE Scan on fio f  (cost=0.00..0.20 rows=10 width=34) (actual time=0.028..0.053 rows=3 loops=1)
                          ->  Bitmap Heap Scan on patname p  (cost=16.58..147.28 rows=75 width=17) (actual time=1.905..3.717 rows=21 loops=3)
                                Recheck Cond: (lex % f.lex)
                                Rows Removed by Index Recheck: 321
                                Heap Blocks: exact=275
                                Buffers: shared hit=325
                                ->  Bitmap Index Scan on patname_fuzzy_idx  (cost=0.00..16.57 rows=75 width=0) (actual time=1.277..1.277 rows=342 loops=3)
                                      Index Cond: (lex % f.lex)
                                      Buffers: shared hit=50
      InitPlan 3 (returns $3)
        ->  CTE Scan on query  (cost=0.00..0.02 rows=1 width=32) (actual time=0.004..0.006 rows=1 loops=1)
      InitPlan 4 (returns $4)
        ->  CTE Scan on query query_1  (cost=0.00..0.02 rows=1 width=32) (actual time=11.834..11.839 rows=1 loops=1)
              Buffers: shared hit=325
      ->  Bitmap Heap Scan on info  (cost=31.99..4885.97 rows=1560 width=100) (actual time=68.184..68.187 rows=1 loops=1)
            Recheck Cond: (to_tsvector('simple'::regconfig, fullname(fullname)) @@ $4)
            Heap Blocks: exact=1
            Buffers: shared hit=536
            ->  Bitmap Index Scan on info_rum_idx  (cost=0.00..31.60 rows=1560 width=0) (actual time=67.847..67.847 rows=1 loops=1)
                  Index Cond: (to_tsvector('simple'::regconfig, fullname(fullname)) @@ $4)
                  Buffers: shared hit=517
    Planning time: 5.012 ms
    Execution time: 68.583 ms
    (37 rows)


    In total, a full-text search index was created for 7 seconds (size 13 MB), a token dictionary index was created in 0.6 seconds (size 5.8 MB), and search was 68 ms. Of the disadvantages, the selectivity is worse than the previous options.

    Phonetic Algorithms


    Having tried the fuzzy search options from the pg_trmg module, I decided to look again at fuzzystrmatch. I did not come up with how to index the Levenshtein function, but the phonetic algorithms were extremely interesting to me. As mentioned above, out of the box in PostgreSQL, phonetic functions are implemented only for the Latin alphabet and are imprisoned for English names. An Internet search of their Russian implementations led me to a wonderful article.on Habré, where the working Metaphone algorithm for Russian names (consisting of Russian letters) was described. Only one thing saddened - although it was simple, it was somehow completely sad to implement this logic on plpgsql, or something on some Python ... And then I remembered that plpython3u is unsafe (functions on it can access file system with the permissions of the postgres process), but a perfectly working language in PostgreSQL. And it would be a sin not to use it. Therefore, I wrote two immutable functions:

    • phoneme on plpython3u, which turns a token into a phoneme ("smirnov" into "smirnaf") according to the algorithm from an article in Habré
    • metaphone on plpgsql, which turns not just one token into a phoneme, but a whole text into a set of phonemes. In fact, it's just a binding over the phoneme function.

    Metaphone and btree


    Next, try to create a regular btree index on the metaphone function and evaluate the speed.
    Pass by, there is nothing interesting
    create or replace function phoneme (in_lexeme text)
    returns text
    language plpython3u
    immutable
    as $$
    import re
    class Lexeme:
        def __init__(self, body):
            """
            :type body: str
            """
            self.body = body.lower().strip()
            # Правила замены гласных
            self._vowels = {"(?:йо|ио|йе|ие)": "и", "[оыя]": "а", "[еёэ]": "и", "[ю]": "у"}
            # Правила оглушения согласных
            self._consonants = {"б": "п", "з": "с", "д": "т", "в": "ф"}
            # Согласная должна быть оглушена, если за ней следует один из символов списка _deafening_chars
            self._deafening_chars = ["б", "в", "г", "д", "ж", "з", "й"]
            # Удаляемые символы
            self._removable_chars = {"[ъь]": ""}
        def _remove_double_chars(self):
            return Lexeme("".join((char for num, char in enumerate(self.body) if char != self.body[num - 1])))
        def _deafen_consonants(self):
            modified_body = ""
            for num, char in enumerate(self.body):
                if char in self._consonants and (
                    num < len(self.body) - 1 and self.body[num + 1] in self._deafening_chars
                    or num == len(self.body) - 1
                ):
                    modified_body += self._consonants[char]
                else:
                    modified_body += char
            return Lexeme(modified_body)
        @staticmethod
        def _regexp_replace(text, char_dict):
            modified_body = text
            for item in char_dict:
                modified_body = re.sub(item, char_dict[item], modified_body)
            return Lexeme(modified_body)
        def _replace_vowels(self):
            return self._regexp_replace(self.body, self._vowels)
        def _remove_chars(self):
            return self._regexp_replace(self.body, self._removable_chars)
        def metaphone(self):
            return self._remove_chars()._replace_vowels()._deafen_consonants()._remove_double_chars().body
    return Lexeme(in_lexeme).metaphone()
    $$;
    create or replace function metaphone (in_phonemes text)
    returns text
    language plpgsql
    immutable
    as $$
    begin
      return (
        select string_agg(q.lex,' ') from (
          select phoneme(lexeme) as lex from unnest(to_tsvector('simple', in_phonemes))
          order by positions
        ) as q
      );
    exception when others then
      raise '%', SQLERRM using errcode = SQLSTATE;
    end;
    $$;
    create index info_metaphone_idx on info (
      metaphone(fullname(fullname)) text_pattern_ops
    );
    CREATE INDEX
    Time: 114.757s (a minute)
    explain (analyze, buffers)
    select patid, fullname from info 
    where metaphone(fullname(fullname)) like 
    regexp_replace(metaphone('смернов дин онатол'),'\s','%','g')||'%' 
    limit 10;
     Limit  (cost=76.03..1388.96 rows=10 width=96) (actual time=22.452..129.944 rows=3 loops=1)
      Buffers: shared hit=239
      ->  Bitmap Heap Scan on info  (cost=76.03..4146.10 rows=31 width=96) 
             (actual time=22.447..129.927 rows=3 loops=1) 
            Filter: (metaphone(fullname(fullname)) ~~ 'смирнаф%дин%анатал%'::text)
            Rows Removed by Filter: 244
            Heap Blocks: exact=234
            Buffers: shared hit=239
            ->  Bitmap Index Scan on info_metaphone_idx  (cost=0.00..76.02 rows=1560 width=0) 
                   (actual time=0.061..0.061 rows=247 loops=1) 
                  Index Cond: ((metaphone(fullname(fullname)) ~>=~ 'смирнаф'::text) AND 
                    (metaphone(fullname(fullname)) ~<~ 'смирнах'::text))
                  Buffers: shared hit=5
    Planning time: 1.012 ms
    Execution time: 129.977 ms
    (12 rows)
    Time: 131,802 ms


    The index was created for 114 seconds, size 22 MB (it seems that I wrote not the most optimal function on python in terms of performance), the request is 131 ms. The index works only on a small part of the substring, and then the filter works because of the "%". Poorly.

    Metaphone and trigrams


    Let's try to build the trigram index based on the metaphone function created on plpython3u. But we will use it now not for a fuzzy search, but for searching a substring.

    How to reduce the time of request folk remedies using ...
    create index info_metaphone_trgm_idx on info 
    using gin (metaphone(fullname(fullname)) gin_trgm_ops);
    CREATE INDEX
    Time: 124.713s (2 minutes)
    explain (analyze, buffers)
    select patid, fullname from info 
    where metaphone(fullname(fullname)) like 
    '%'||regexp_replace(metaphone('смернов дин онатол'),'\s','%','g')||'%' limit 10;
     Limit  (cost=92.24..134.98 rows=10 width=96) (actual time=9.562..10.638 rows=3 loops=1)
      Buffers: shared hit=103
      ->  Bitmap Heap Scan on info  (cost=92.24..224.74 rows=31 width=96) (actual time=9.554..10.617 rows=3 loops=1)
            Recheck Cond: (metaphone(fullname(fullname)) ~~ '%смирнаф%дин%анатал%'::text)
            Heap Blocks: exact=2
            Buffers: shared hit=103
            ->  Bitmap Index Scan on info_metaphone_trgm_idx  (cost=0.00..92.23 rows=31 width=0) (actual time=8.354..8.354 rows=3 loops=1)
                  Index Cond: (metaphone(fullname(fullname)) ~~ '%смирнаф%дин%анатал%'::text)
                  Buffers: shared hit=101
    Planning time: 2.029 ms
    Execution time: 10.726 ms
    (11 rows)
    Time: 14,480 ms


    Index creation time - 124 sec, size 15 Mb (hello my crooked hands and plpython3u), search - 14 ms.

    Summary


    Let's summarize the various search options with typos
    UPDATE 1: added the implementation of Metaphone from movEAX .
    UPDATE 2: added Metaphone implementation on plpgsql from Ivan Milovanov (telegram - milovanov)
    Metaphone on plpgsql
    
    create or replace function phoneme (in_lexeme text)
    returns text
    language plpgsql
    immutable
    as $$
    declare
      res varchar(100) DEFAULT '';
    begin
      res := lower(in_lexeme);
      res := regexp_replace(res,'[ъь]','','g');
      res := regexp_replace(res,'(йо|ио|йе|ие)','и','g');
      res := regexp_replace(res,'[оыя]','а','g');
      res := regexp_replace(res,'[еёэ]','и','g');
      res := regexp_replace(res,'ю','у','g');
      res := regexp_replace(res,'б([псткбвгджзфхцчшщ]|$)','п\1','g');
      res := regexp_replace(res,'з([псткбвгджзфхцчшщ]|$)','с\1','g');
      res := regexp_replace(res,'д([псткбвгджзфхцчшщ]|$)','т\1','g');
      res := regexp_replace(res,'в([псткбвгджзфхцчшщ]|$)','ф\1','g');
      res := regexp_replace(res,'г([псткбвгджзфхцчшщ]|$)','к\1','g');  
      res := regexp_replace(res,'дс','ц','g');
      res := regexp_replace(res,'тс','ц','g');
      res := regexp_replace(res,'(.)\1','\1','g');
      return res;
    exception
      when others then raise exception '%', sqlerrm;
    end;
    $$;
    


    Type of search
    Index Creation Time
    Index size
    Typo Search Speed
    Remarks
    Trigrams gist
    15 sec
    45 Mb
    158 ms
    Gin trigrams
    10 sec
    18 Mb
    133 ms
    Trigrams and full-text search
    7.6 sec
    18.8 Mb
    68 ms
    Worse selectivity, you need to maintain a dictionary of tokens
    Metaphone btree
    114 sec
    22 Mb
    131 ms
    Insecure language plpython3u
    Metaphone Trigrams
    124 sec
    15 Mb
    14 ms
    Insecure language plpython3u
    Implementing Metaphone Trigrams from movEAX
    77.8 sec
    16 Mb
    14 ms
    Insecure language plpython3u
    Implementation of Ivan Milovanov at plpgsql
    72.0 sec
    16 Mb
    14 ms

    UPDATE 3: when the index contains “smirnaf dinis anatalievich” , the letter “in” in the middle name is not stunned (since there is a vowel after it). If you look at the substring metaphone ('Anatole') , then the letter “c” is not over the vowel, but at the end it will be stunned. To get around this problem, the mquery function is written below and the search is performed by the expression
    select metaphone('смирнов денис анатольевич') similar to mquery('Смернов дини онатольев');

    Mqery function
    
    create or replace function mquery(in_fullname text)
    returns text
    language plpgsql
    immutable
    as $$
    declare
      res text;
    begin
      res := metaphone(in_fullname);
      res := regexp_replace(res, '(б|п)', '(б|п)', 'g');
      res := regexp_replace(res, '(з|с)', '(з|с)', 'g');
      res := regexp_replace(res, '(д|т)', '(д|т)', 'g');
      res := regexp_replace(res, '(в|ф)', '(в|ф)', 'g');
      res := regexp_replace(res, '(г|к)', '(г|к)', 'g');
      res := regexp_replace(res, '\s', '%', 'g');
      return '%'||res||'%';
    exception
      when others then raise exception '%', sqlerrm;
    end;
    $$;
    



    Since in my case the system will be focused on reading rather than writing (maximum adding a pair of patients per minute), then my option is Metaphone with trigrams. If someone has ideas on how to optimize a function in Python in terms of speed, then unsubscribe in the comments, I will add data to the tests.

    Also popular now: