Fuzzy search in relational databases

    To search for the necessary information on websites and in mobile applications, a search is often used for words or phrases that the user freely enters from the keyboard (and does not select, for example, from the list). Naturally, the user can make mistakes and typos. In this case, full-text search, full-text indexes that are implemented in most databases do not give the expected result and are practically useless. Such functionality is increasingly being implemented based on elasticsearch.

    Solutions using elasticsearch have one major drawback - there is a very high probability of a mismatch in the main database, such as PostgreSQL, MySQL, mongodb, and elasticsearch, which stores search indexes.

    An ideal option would be to have a “bridge” that would take on the function of reconciling data in the event that the database for the search engine would be unavailable during the update of the main database. But I have not yet found the implementation of such a bridge. For example, in one of the bundle projects mongodb and lucene, this problem is mentioned.
    On a normal shutdown of a LuMongo node, all segments committed and are distributed to existing nodes. This allows for rolling shutdowns of the nodes to update them. On unexpected shutdown the segments will fail to the existing nodes without committing. These indexes could require rollback or repair. Currently this is not handled automatically but it will be in future releases using Lucene's built in index repair. Since the documents are stored in MongoDB and not in the index, another possible solution could be fetching the documents for a corrupted segment and reindexing them. MongoDB also provides seamless failover through replication. MongoDB's replication is data center aware backups across datacenters are possible.
    How is this problem solved in practice? No way. If the data is not very large, then the database is simply reindexed by a timer. If the database is large and it is often impossible to reindex it, then everything remains as it is, inconsistent, it is just a little more difficult to identify this inconsistency.

    I hope that stable "bridges" that will steadily update indexes in elasticsearch or lucene appear sooner or later. Now there is a need to find a working solution.

    One option is to use a single database for storing data and for searching. Regarding the use of elasticsearch as such a single base, in almost all discussions on the forums there was a consensus that such a solution would not work. So I started looking for a database in which it would be possible to create full-text indexes that support fuzzy search. Since the main engine for such indexes, lucene, was developed in java, the circle of databases in which I looked for such an opportunity was clearly outlined.

    As it turned out, there are at least two solutions that use the lucene library and are at the poduction ready application level: these are orientdb and h2.

    In orientdb, working with fuzzy full-text search is very simple:

    create class russian
    create property russian.message string
    create index russian.message on russian(message)
      fulltext engine lucene metadata {
        "analyzer": "org.apache.lucene.analysis.ru.RussianAnalyzer"
      }
    select * from russian where message lucene 'Харбахрб~0.5' limit 2
    

    In h2 it’s a bit more complicated because an index is a separate table with which you need to associate the main table. But a little more complicated it does not mean difficult.

    CREATE ALIAS IF NOT EXISTS FTL_INIT FOR 
    "org.h2.fulltext.FullTextLucene.init";
    CALL FTL_INIT();
    DROP TABLE IF EXISTS TEST;
    CREATE TABLE TEST(ID INT PRIMARY KEY, FIRST_NAME VARCHAR, 
    LAST_NAME VARCHAR);
    CALL FTL_CREATE_INDEX('PUBLIC', 'TEST', NULL);
    INSERT INTO TEST VALUES(1, 'John', 'Wayne');
    INSERT INTO TEST VALUES(2, 'Elton', 'John');
    SELECT * FROM FTL_SEARCH_DATA('John', 0, 0);
    SELECT * FROM FTL_SEARCH_DATA('LAST_NAME:John', 0, 0);
    

    apapacy@gmail.com
    April 22, 2018

    Also popular now: