![](http://habrastorage.org/getpro/habr/avatars/074/602/98d/07460298dbe1e7e6dcbdd70f3d434b2a.jpg)
Looking for misspelled names in PostgreSQL
It all started with the fact that I needed to develop a patient search for one internal medical system. The logic of the work was that if we did not find a person in the system, then we need to create it (and duplicate patients cannot be produced). In this regard, one of the sub-tasks was the implementation of people search taking into account typos in their names. Well, since I love PostgreSQL (and when you have a hammer in your hands, everything looks like nails), it's not difficult to guess what I decided to implement a typo search on ...
![](https://habrastorage.org/webt/sj/rm/sq/sjrmsqmnovu_xaq--unhed2a3qs.jpeg)
Typically, this problem can be solved in two ways: using fuzzy search or phonetic algorithms. Being a lazy person and a holy believer that everything had long beenstolen invented before us, I turned to the PostgreSQL documentation. PostgreSQL currently has two modules that can help with typos: pg_trgm and fuzzystrmatch .
In this regard, I started looking for a solution to the problem where it is lighter, namely from the pg_trgm module.
For simplicity of the model, consider the info table with the patient ID, his last name, first name and patronymic. Since we want gist / gin indexes, we first need to know an important but unpleasant moment: one gist / gin index - one column of the table. It cannot be created, for example, by concatenating several columns. Therefore, below the cat will be created:
We insert into the table about 300 thousand records of the full name and proceed.
So, first we check the gist index for fuzzy trigram search.
The index was created 15 seconds, size 45 MB, search by an incomplete name with typos - 158 ms.
Next, consider the gin index for fuzzy trigram search.
Index creation 10 sec, size 18 MB, search by incomplete name with typos - 133 ms.
To be honest, the results are so-so - because I have a Chinese ssd disk from the glorious city of Shenzhen on my laptop. Therefore, we will try to speed up the process by combining fuzzy and full-text search.
The idea is extremely simple - to collect from all spellings of surnames, names and patronymics a separate table-dictionary. First, we cut the input search string into tokens, look for each of the tokens in the dictionary table through a fuzzy search, select all possible spelling variants of each token from there, put them in tsquery and do a full-text search on the tsvector index of the info table. The benefit of this plan is that the speed of a fuzzy search by trigrams depends on the width of the line and their number in the column with text. Obviously, the full name dictionary will be more compact than the original column in the info table, which means that the search will be faster. There is only one drawback of the method - when you add each new patient, you will have to update the dictionary if the tokens from the name are not found in it. For verification we need to collect rum from sourceindex for constructing a tsvector index by name in the info table. Rum is a modified version of the gin index, which stores additional information in the leaves. In our case, we use the rum_tsvector_ops operator class, which stores positional information about the token in the index. Therefore, unlike gin, we can use an index-only tsquery query of the form
In total, a full-text search index was created for 7 seconds (size 13 MB), a token dictionary index was created in 0.6 seconds (size 5.8 MB), and search was 68 ms. Of the disadvantages, the selectivity is worse than the previous options.
Having tried the fuzzy search options from the pg_trmg module, I decided to look again at fuzzystrmatch. I did not come up with how to index the Levenshtein function, but the phonetic algorithms were extremely interesting to me. As mentioned above, out of the box in PostgreSQL, phonetic functions are implemented only for the Latin alphabet and are imprisoned for English names. An Internet search of their Russian implementations led me to a wonderful article.on Habré, where the working Metaphone algorithm for Russian names (consisting of Russian letters) was described. Only one thing saddened - although it was simple, it was somehow completely sad to implement this logic on plpgsql, or something on some Python ... And then I remembered that plpython3u is unsafe (functions on it can access file system with the permissions of the postgres process), but a perfectly working language in PostgreSQL. And it would be a sin not to use it. Therefore, I wrote two immutable functions:
Next, try to create a regular btree index on the metaphone function and evaluate the speed.
The index was created for 114 seconds, size 22 MB (it seems that I wrote not the most optimal function on python in terms of performance), the request is 131 ms. The index works only on a small part of the substring, and then the filter works because of the "%". Poorly.
Let's try to build the trigram index based on the metaphone function created on plpython3u. But we will use it now not for a fuzzy search, but for searching a substring.
Index creation time - 124 sec, size 15 Mb (hello my crooked hands and plpython3u), search - 14 ms.
Let's summarize the various search options with typos
UPDATE 1: added the implementation of Metaphone from movEAX .
UPDATE 2: added Metaphone implementation on plpgsql from Ivan Milovanov (telegram - milovanov)
UPDATE 3: when the index contains “smirnaf dinis anatalievich” , the letter “in” in the middle name is not stunned (since there is a vowel after it). If you look at the substring metaphone ('Anatole') , then the letter “c” is not over the vowel, but at the end it will be stunned. To get around this problem, the mquery function is written below and the search is performed by the expression
Since in my case the system will be focused on reading rather than writing (maximum adding a pair of patients per minute), then my option is Metaphone with trigrams. If someone has ideas on how to optimize a function in Python in terms of speed, then unsubscribe in the comments, I will add data to the tests.
![](https://habrastorage.org/webt/sj/rm/sq/sjrmsqmnovu_xaq--unhed2a3qs.jpeg)
Typically, this problem can be solved in two ways: using fuzzy search or phonetic algorithms. Being a lazy person and a holy believer that everything had long been
- pg_trgm works with trigrams; it can search by substring and fuzzy search. As indices it works with gist and gin .
- fuzzystrmatch can count the Levenshtein distance between words and three phonetic algorithms: Soundex , Metaphone and Double Metaphone . The pitfalls are, firstly, that the Levenshtein function in this module does not allow you to create an index for an arbitrary search query. Secondly, all phonetic algorithms are implemented for the Latin alphabet.
In this regard, I started looking for a solution to the problem where it is lighter, namely from the pg_trgm module.
Trigrams
For simplicity of the model, consider the info table with the patient ID, his last name, first name and patronymic. Since we want gist / gin indexes, we first need to know an important but unpleasant moment: one gist / gin index - one column of the table. It cannot be created, for example, by concatenating several columns. Therefore, below the cat will be created:
- pg_trgm extension
- patient table storing the name in the form of jsonb (with checks for the existence and filling of keys)
- immutable function for constructing a trigram index, converting the name from jsonb to text
Boring SQL code
create extension pg_trgm;
create table info (
patid integer, fullname jsonb,
constraint info_pk primary key (patid),
constraint fullname_exists check (
fullname ? 'lname'::text and
fullname ? 'fname'::text and
fullname ? 'sname'::text
),
constraint fullname_notnull check (
(fullname ->> 'lname'::text) is not null and
(fullname ->> 'fname'::text) is not null
)
);
create function fullname(in_fullname jsonb)
returns text
language plpgsql
immutable
as $$
begin
return regexp_replace(
lower(
trim(
coalesce(in_fullname->>'lname', '') || ' ' ||
coalesce(in_fullname->>'fname', '') || ' ' ||
coalesce(in_fullname->>'sname', '')
)
),
'ё',
'е',
'g'
);
exception
when others then raise exception '%', sqlerrm;
end;
$$;
We insert into the table about 300 thousand records of the full name and proceed.
Trigrams and GIST
So, first we check the gist index for fuzzy trigram search.
Even more boring SQL code
create index info_gist_idx on info
using gist (fullname(fullname) gist_trgm_ops);
CREATE INDEX
Time: 15054,102 ms
explain (analyze, buffers)
select patid, fullname(fullname) <-> 'смерно дени анато' as dist
from info order by dist limit 10;
Limit (cost=0.28..4.35 rows=10 width=8) (actual time=157.378..157.688 rows=10 loops=1)
Buffers: shared hit=5743
-> Index Scan using info_gist_idx on info (cost=0.28..126822.96 rows=312084 width=8) (actual time=157.371..157.655 rows=10 loops=1)
Order By: (fullname(fullname) <-> 'смерно дени анато'::text)
Buffers: shared hit=5743
Planning time: 0.225 ms
Execution time: 158.223 ms
(7 rows)
The index was created 15 seconds, size 45 MB, search by an incomplete name with typos - 158 ms.
Trigrams and GIN
Next, consider the gin index for fuzzy trigram search.
Do you think previous SQL spoilers were boring?
create index info_trgm_idx on info
using gin(fullname(fullname) gin_trgm_ops);
CREATE INDEX
Time: 10163,401 ms
explain (analyze, buffers)
select patid, similarity(fullname(fullname), 'смерно дени анато' ) as sml
from info where true
and fullname(fullname) % 'смерно дени анато'
order by sml desc limit 10;
Limit (cost=1180.22..1180.25 rows=10 width=8) (actual time=133.086..133.117 rows=8 loops=1)
Buffers: shared hit=5741
-> Sort (cost=1180.22..1181.00 rows=312 width=8) (actual time=133.080..133.090 rows=8 loops=1)
Sort Key: (similarity(fullname(fullname), 'смерно дени анато'::text)) DESC
Sort Method: quicksort Memory: 25kB
Buffers: shared hit=5741
-> Bitmap Heap Scan on info (cost=26.70..1173.48 rows=312 width=8) (actual time=132.828..133.048 rows=8 loops=1)
Recheck Cond: (fullname(fullname) % 'смерно дени анато'::text)
Heap Blocks: exact=7
Buffers: shared hit=5741
-> Bitmap Index Scan on info_gist_idx (cost=0.00..26.62 rows=312 width=0) (actual time=132.699..132.699 rows=8 loops=1)
Index Cond: (fullname(fullname) % 'смерно дени анато'::text)
Buffers: shared hit=5734
Planning time: 0.573 ms
Execution time: 133.225 ms
(15 rows)
Index creation 10 sec, size 18 MB, search by incomplete name with typos - 133 ms.
To be honest, the results are so-so - because I have a Chinese ssd disk from the glorious city of Shenzhen on my laptop. Therefore, we will try to speed up the process by combining fuzzy and full-text search.
Trigrams and full-text search
The idea is extremely simple - to collect from all spellings of surnames, names and patronymics a separate table-dictionary. First, we cut the input search string into tokens, look for each of the tokens in the dictionary table through a fuzzy search, select all possible spelling variants of each token from there, put them in tsquery and do a full-text search on the tsvector index of the info table. The benefit of this plan is that the speed of a fuzzy search by trigrams depends on the width of the line and their number in the column with text. Obviously, the full name dictionary will be more compact than the original column in the info table, which means that the search will be faster. There is only one drawback of the method - when you add each new patient, you will have to update the dictionary if the tokens from the name are not found in it. For verification we need to collect rum from sourceindex for constructing a tsvector index by name in the info table. Rum is a modified version of the gin index, which stores additional information in the leaves. In our case, we use the rum_tsvector_ops operator class, which stores positional information about the token in the index. Therefore, unlike gin, we can use an index-only tsquery query of the form
(смернов|смирнов|чернов)<->(денис)<->(анатольевич)
without going to the table for more information about the order of the token in the tuple. Moreover, the recommendation for gin is the physical existence of the tsvector column, since all found pointers to tuples will have to be double-checked in the table. And if physically there is no tsvector column (you built it with a function for the index), then for each tuple you will have to perform an additional tsvector calculation. In general, rum in this story will be much more productive.Everest in the world of boring SQL
create extension rum;
create index info_rum_idx on info
using rum (
to_tsvector('simple'::regconfig, fullname(fullname)) rum_tsvector_ops
);
CREATE INDEX
Time: 7.545s (7 seconds)
create table patname (
lex text,
constraint patname_uniq_idx unique (lex)
);
create index patname_fuzzy_idx on patname using gin (lex gin_trgm_ops);
CREATE INDEX
Time: 0.596s
insert into patname (lex)
select word from ts_stat($$
select to_tsvector('simple', fullname(fullname)) from info
$$);
explain (analyze, buffers)
with fio as (
select lexeme as lex, positions[1] as pos
from unnest(to_tsvector('simple','смернов дин онатол'))
),
query as(
select to_tsquery('simple', string_agg(q.tq,'&')) as q from (
select f.pos, '('||string_agg(p.lex,'|')||')' as tq from fio as f
join patname as p on p.lex % f.lex
group by f.pos
) as q
)
select to_tsvector('simple'::regconfig, fullname(fullname))
<=> (select q from query) as rank, * from info
where to_tsvector('simple'::regconfig, fullname(fullname))
@@ (select q from query)
order by to_tsvector('simple'::regconfig, fullname(fullname))
<=> (select q from query);
Sort (cost=6453.71..6457.61 rows=1560 width=100) (actual time=68.201..68.202 rows=1 loops=1)
Sort Key: ((to_tsvector('simple'::regconfig, fullname(info.fullname)) <=> $3))
Sort Method: quicksort Memory: 25kB
Buffers: shared hit=536
CTE fio
-> Function Scan on unnest (cost=0.00..0.10 rows=10 width=34) (actual time=0.023..0.034 rows=3 loops=1)
CTE query
-> Aggregate (cost=1484.60..1484.86 rows=1 width=32) (actual time=11.829..11.830 rows=1 loops=1)
Buffers: shared hit=325
-> HashAggregate (cost=1484.30..1484.48 rows=10 width=34) (actual time=11.640..11.644 rows=2 loops=1)
Group Key: f.pos
Buffers: shared hit=325
-> Nested Loop (cost=16.58..1480.53 rows=755 width=19) (actual time=2.940..11.442 rows=62 loops=1)
Buffers: shared hit=325
-> CTE Scan on fio f (cost=0.00..0.20 rows=10 width=34) (actual time=0.028..0.053 rows=3 loops=1)
-> Bitmap Heap Scan on patname p (cost=16.58..147.28 rows=75 width=17) (actual time=1.905..3.717 rows=21 loops=3)
Recheck Cond: (lex % f.lex)
Rows Removed by Index Recheck: 321
Heap Blocks: exact=275
Buffers: shared hit=325
-> Bitmap Index Scan on patname_fuzzy_idx (cost=0.00..16.57 rows=75 width=0) (actual time=1.277..1.277 rows=342 loops=3)
Index Cond: (lex % f.lex)
Buffers: shared hit=50
InitPlan 3 (returns $3)
-> CTE Scan on query (cost=0.00..0.02 rows=1 width=32) (actual time=0.004..0.006 rows=1 loops=1)
InitPlan 4 (returns $4)
-> CTE Scan on query query_1 (cost=0.00..0.02 rows=1 width=32) (actual time=11.834..11.839 rows=1 loops=1)
Buffers: shared hit=325
-> Bitmap Heap Scan on info (cost=31.99..4885.97 rows=1560 width=100) (actual time=68.184..68.187 rows=1 loops=1)
Recheck Cond: (to_tsvector('simple'::regconfig, fullname(fullname)) @@ $4)
Heap Blocks: exact=1
Buffers: shared hit=536
-> Bitmap Index Scan on info_rum_idx (cost=0.00..31.60 rows=1560 width=0) (actual time=67.847..67.847 rows=1 loops=1)
Index Cond: (to_tsvector('simple'::regconfig, fullname(fullname)) @@ $4)
Buffers: shared hit=517
Planning time: 5.012 ms
Execution time: 68.583 ms
(37 rows)
In total, a full-text search index was created for 7 seconds (size 13 MB), a token dictionary index was created in 0.6 seconds (size 5.8 MB), and search was 68 ms. Of the disadvantages, the selectivity is worse than the previous options.
Phonetic Algorithms
Having tried the fuzzy search options from the pg_trmg module, I decided to look again at fuzzystrmatch. I did not come up with how to index the Levenshtein function, but the phonetic algorithms were extremely interesting to me. As mentioned above, out of the box in PostgreSQL, phonetic functions are implemented only for the Latin alphabet and are imprisoned for English names. An Internet search of their Russian implementations led me to a wonderful article.on Habré, where the working Metaphone algorithm for Russian names (consisting of Russian letters) was described. Only one thing saddened - although it was simple, it was somehow completely sad to implement this logic on plpgsql, or something on some Python ... And then I remembered that plpython3u is unsafe (functions on it can access file system with the permissions of the postgres process), but a perfectly working language in PostgreSQL. And it would be a sin not to use it. Therefore, I wrote two immutable functions:
- phoneme on plpython3u, which turns a token into a phoneme ("smirnov" into "smirnaf") according to the algorithm from an article in Habré
- metaphone on plpgsql, which turns not just one token into a phoneme, but a whole text into a set of phonemes. In fact, it's just a binding over the phoneme function.
Metaphone and btree
Next, try to create a regular btree index on the metaphone function and evaluate the speed.
Pass by, there is nothing interesting
create or replace function phoneme (in_lexeme text)
returns text
language plpython3u
immutable
as $$
import re
class Lexeme:
def __init__(self, body):
"""
:type body: str
"""
self.body = body.lower().strip()
# Правила замены гласных
self._vowels = {"(?:йо|ио|йе|ие)": "и", "[оыя]": "а", "[еёэ]": "и", "[ю]": "у"}
# Правила оглушения согласных
self._consonants = {"б": "п", "з": "с", "д": "т", "в": "ф"}
# Согласная должна быть оглушена, если за ней следует один из символов списка _deafening_chars
self._deafening_chars = ["б", "в", "г", "д", "ж", "з", "й"]
# Удаляемые символы
self._removable_chars = {"[ъь]": ""}
def _remove_double_chars(self):
return Lexeme("".join((char for num, char in enumerate(self.body) if char != self.body[num - 1])))
def _deafen_consonants(self):
modified_body = ""
for num, char in enumerate(self.body):
if char in self._consonants and (
num < len(self.body) - 1 and self.body[num + 1] in self._deafening_chars
or num == len(self.body) - 1
):
modified_body += self._consonants[char]
else:
modified_body += char
return Lexeme(modified_body)
@staticmethod
def _regexp_replace(text, char_dict):
modified_body = text
for item in char_dict:
modified_body = re.sub(item, char_dict[item], modified_body)
return Lexeme(modified_body)
def _replace_vowels(self):
return self._regexp_replace(self.body, self._vowels)
def _remove_chars(self):
return self._regexp_replace(self.body, self._removable_chars)
def metaphone(self):
return self._remove_chars()._replace_vowels()._deafen_consonants()._remove_double_chars().body
return Lexeme(in_lexeme).metaphone()
$$;
create or replace function metaphone (in_phonemes text)
returns text
language plpgsql
immutable
as $$
begin
return (
select string_agg(q.lex,' ') from (
select phoneme(lexeme) as lex from unnest(to_tsvector('simple', in_phonemes))
order by positions
) as q
);
exception when others then
raise '%', SQLERRM using errcode = SQLSTATE;
end;
$$;
create index info_metaphone_idx on info (
metaphone(fullname(fullname)) text_pattern_ops
);
CREATE INDEX
Time: 114.757s (a minute)
explain (analyze, buffers)
select patid, fullname from info
where metaphone(fullname(fullname)) like
regexp_replace(metaphone('смернов дин онатол'),'\s','%','g')||'%'
limit 10;
Limit (cost=76.03..1388.96 rows=10 width=96) (actual time=22.452..129.944 rows=3 loops=1)
Buffers: shared hit=239
-> Bitmap Heap Scan on info (cost=76.03..4146.10 rows=31 width=96)
(actual time=22.447..129.927 rows=3 loops=1)
Filter: (metaphone(fullname(fullname)) ~~ 'смирнаф%дин%анатал%'::text)
Rows Removed by Filter: 244
Heap Blocks: exact=234
Buffers: shared hit=239
-> Bitmap Index Scan on info_metaphone_idx (cost=0.00..76.02 rows=1560 width=0)
(actual time=0.061..0.061 rows=247 loops=1)
Index Cond: ((metaphone(fullname(fullname)) ~>=~ 'смирнаф'::text) AND
(metaphone(fullname(fullname)) ~<~ 'смирнах'::text))
Buffers: shared hit=5
Planning time: 1.012 ms
Execution time: 129.977 ms
(12 rows)
Time: 131,802 ms
The index was created for 114 seconds, size 22 MB (it seems that I wrote not the most optimal function on python in terms of performance), the request is 131 ms. The index works only on a small part of the substring, and then the filter works because of the "%". Poorly.
Metaphone and trigrams
Let's try to build the trigram index based on the metaphone function created on plpython3u. But we will use it now not for a fuzzy search, but for searching a substring.
How to reduce the time of request folk remedies using ...
create index info_metaphone_trgm_idx on info
using gin (metaphone(fullname(fullname)) gin_trgm_ops);
CREATE INDEX
Time: 124.713s (2 minutes)
explain (analyze, buffers)
select patid, fullname from info
where metaphone(fullname(fullname)) like
'%'||regexp_replace(metaphone('смернов дин онатол'),'\s','%','g')||'%' limit 10;
Limit (cost=92.24..134.98 rows=10 width=96) (actual time=9.562..10.638 rows=3 loops=1)
Buffers: shared hit=103
-> Bitmap Heap Scan on info (cost=92.24..224.74 rows=31 width=96) (actual time=9.554..10.617 rows=3 loops=1)
Recheck Cond: (metaphone(fullname(fullname)) ~~ '%смирнаф%дин%анатал%'::text)
Heap Blocks: exact=2
Buffers: shared hit=103
-> Bitmap Index Scan on info_metaphone_trgm_idx (cost=0.00..92.23 rows=31 width=0) (actual time=8.354..8.354 rows=3 loops=1)
Index Cond: (metaphone(fullname(fullname)) ~~ '%смирнаф%дин%анатал%'::text)
Buffers: shared hit=101
Planning time: 2.029 ms
Execution time: 10.726 ms
(11 rows)
Time: 14,480 ms
Index creation time - 124 sec, size 15 Mb (hello my crooked hands and plpython3u), search - 14 ms.
Summary
Let's summarize the various search options with typos
UPDATE 1: added the implementation of Metaphone from movEAX .
UPDATE 2: added Metaphone implementation on plpgsql from Ivan Milovanov (telegram - milovanov)
Metaphone on plpgsql
create or replace function phoneme (in_lexeme text)
returns text
language plpgsql
immutable
as $$
declare
res varchar(100) DEFAULT '';
begin
res := lower(in_lexeme);
res := regexp_replace(res,'[ъь]','','g');
res := regexp_replace(res,'(йо|ио|йе|ие)','и','g');
res := regexp_replace(res,'[оыя]','а','g');
res := regexp_replace(res,'[еёэ]','и','g');
res := regexp_replace(res,'ю','у','g');
res := regexp_replace(res,'б([псткбвгджзфхцчшщ]|$)','п\1','g');
res := regexp_replace(res,'з([псткбвгджзфхцчшщ]|$)','с\1','g');
res := regexp_replace(res,'д([псткбвгджзфхцчшщ]|$)','т\1','g');
res := regexp_replace(res,'в([псткбвгджзфхцчшщ]|$)','ф\1','g');
res := regexp_replace(res,'г([псткбвгджзфхцчшщ]|$)','к\1','g');
res := regexp_replace(res,'дс','ц','g');
res := regexp_replace(res,'тс','ц','g');
res := regexp_replace(res,'(.)\1','\1','g');
return res;
exception
when others then raise exception '%', sqlerrm;
end;
$$;
Type of search | Index Creation Time | Index size | Typo Search Speed | Remarks |
Trigrams gist | 15 sec | 45 Mb | 158 ms | |
Gin trigrams | 10 sec | 18 Mb | 133 ms | |
Trigrams and full-text search | 7.6 sec | 18.8 Mb | 68 ms | Worse selectivity, you need to maintain a dictionary of tokens |
Metaphone btree | 114 sec | 22 Mb | 131 ms | Insecure language plpython3u |
Metaphone Trigrams | 124 sec | 15 Mb | 14 ms | Insecure language plpython3u |
Implementing Metaphone Trigrams from movEAX | 77.8 sec | 16 Mb | 14 ms | Insecure language plpython3u |
Implementation of Ivan Milovanov at plpgsql | 72.0 sec | 16 Mb | 14 ms |
UPDATE 3: when the index contains “smirnaf dinis anatalievich” , the letter “in” in the middle name is not stunned (since there is a vowel after it). If you look at the substring metaphone ('Anatole') , then the letter “c” is not over the vowel, but at the end it will be stunned. To get around this problem, the mquery function is written below and the search is performed by the expression
select metaphone('смирнов денис анатольевич') similar to mquery('Смернов дини онатольев');
Mqery function
create or replace function mquery(in_fullname text)
returns text
language plpgsql
immutable
as $$
declare
res text;
begin
res := metaphone(in_fullname);
res := regexp_replace(res, '(б|п)', '(б|п)', 'g');
res := regexp_replace(res, '(з|с)', '(з|с)', 'g');
res := regexp_replace(res, '(д|т)', '(д|т)', 'g');
res := regexp_replace(res, '(в|ф)', '(в|ф)', 'g');
res := regexp_replace(res, '(г|к)', '(г|к)', 'g');
res := regexp_replace(res, '\s', '%', 'g');
return '%'||res||'%';
exception
when others then raise exception '%', sqlerrm;
end;
$$;
Since in my case the system will be focused on reading rather than writing (maximum adding a pair of patients per minute), then my option is Metaphone with trigrams. If someone has ideas on how to optimize a function in Python in terms of speed, then unsubscribe in the comments, I will add data to the tests.