Postgres in Chinese or setting up Full Text Search in Postgres for Chinese

    image

    A client contacted us with a request to update PostgreSQL to the latest version, and at the same time to teach it Chinese.
    More precisely, to optimize the process of full-text search in Chinese, because the whole thing slowed down mercilessly.

    The following describes how we did this.
    Let's get down to business right away.


    The first step is to build and install Simple Chinese Word Segmentation (SCWS) and the zhparser
    SCWS extension itself
    wget -q -O - http://www.xunsearch.com/scws/down/scws-1.2.2.tar.bz2 | tar xf -
    cd scws-1.2.2 ; ./configure ; make install
    

    Now zhparser
    github.com/amutu/zhparser - there is at the same time a full-fledged instruction, and we do it.
    git clone https://github.com/amutu/zhparser.git
    SCWS_HOME=/usr/local make && make install
    

    If it was successful, enable the extension in the desired database
    psql -U postgres -d test_dbname -c 'CREATE EXTENSION zhparser'
    

    Next, go to the desired database and create the configuration
    test_dbname=# CREATE TEXT SEARCH CONFIGURATION testzhcfg (PARSER = zhparser);
    test_dbname=# ALTER TEXT SEARCH CONFIGURATION testzhcfg ADD MAPPING FOR n,v,a,i,e,l WITH simple;
    

    Let's say we already have a table with data.
    CREATE TABLE messages (
        title       text,
        body        text
    );
    

    And we need to search the columns of title and body. Let there be such a record
    INSERT INTO messages VALUES('批发新', '款新婴幼');
    

    To do this, we need to add the tsv column with the tsvector type for tokens.
    test_dbname=# ALTER TABLE messages ADD COLUMN tsv tsvector;
    

    We hang on it the gin index (https://ru.wikipedia.org/wiki/GIN)
    test_dbname=# CREATE INDEX tsv_idx ON messages USING gin(tsv);
    

    Now, create a trigger for automatically updating tokens in the tsv column using our public.testzhcfg configuration that we created earlier.
    test_dbname=# CREATE TRIGGER tsvectorupdate BEFORE INSERT OR UPDATE  ON messages FOR EACH ROW EXECUTE PROCEDURE tsvector_update_trigger( tsv, 'public.testzhcfg', title, body );
    

    And the last step.
    We need to update the tsv column with our new trigger. (and as we see it is activated when insert and update), for this we through UPDATE rewrite the current values ​​of the title and body fields, the same values.
    test_dbname=# UPDATE messages SET title=title, body=body;
    

    You can try
    SELECT title, body FROM messages WHERE tsv @@ to_tsquery ('批 & 款');

    Questions, suggestions and suggestions write in the comments.
    Thanks for attention!

    Also popular now: