Postgres in Chinese or setting up Full Text Search in Postgres for Chinese
A client contacted us with a request to update PostgreSQL to the latest version, and at the same time to teach it Chinese.
More precisely, to optimize the process of full-text search in Chinese, because the whole thing slowed down mercilessly.
The following describes how we did this.
Let's get down to business right away.
The first step is to build and install Simple Chinese Word Segmentation (SCWS) and the zhparser
SCWS extension itself
wget -q -O - http://www.xunsearch.com/scws/down/scws-1.2.2.tar.bz2 | tar xf -
cd scws-1.2.2 ; ./configure ; make install
Now zhparser
github.com/amutu/zhparser - there is at the same time a full-fledged instruction, and we do it.
git clone https://github.com/amutu/zhparser.git
SCWS_HOME=/usr/local make && make install
If it was successful, enable the extension in the desired database
psql -U postgres -d test_dbname -c 'CREATE EXTENSION zhparser'
Next, go to the desired database and create the configuration
test_dbname=# CREATE TEXT SEARCH CONFIGURATION testzhcfg (PARSER = zhparser);
test_dbname=# ALTER TEXT SEARCH CONFIGURATION testzhcfg ADD MAPPING FOR n,v,a,i,e,l WITH simple;
Let's say we already have a table with data.
CREATE TABLE messages (
title text,
body text
);
And we need to search the columns of title and body. Let there be such a record
INSERT INTO messages VALUES('批发新', '款新婴幼');
To do this, we need to add the tsv column with the tsvector type for tokens.
test_dbname=# ALTER TABLE messages ADD COLUMN tsv tsvector;
We hang on it the gin index (https://ru.wikipedia.org/wiki/GIN)
test_dbname=# CREATE INDEX tsv_idx ON messages USING gin(tsv);
Now, create a trigger for automatically updating tokens in the tsv column using our public.testzhcfg configuration that we created earlier.
test_dbname=# CREATE TRIGGER tsvectorupdate BEFORE INSERT OR UPDATE ON messages FOR EACH ROW EXECUTE PROCEDURE tsvector_update_trigger( tsv, 'public.testzhcfg', title, body );
And the last step.
We need to update the tsv column with our new trigger. (and as we see it is activated when insert and update), for this we through UPDATE rewrite the current values of the title and body fields, the same values.
test_dbname=# UPDATE messages SET title=title, body=body;
You can try
SELECT title, body FROM messages WHERE tsv @@ to_tsquery ('批 & 款');
Questions, suggestions and suggestions write in the comments.
Thanks for attention!