Integration of the Russian words stemming algorithm in fts3 SQLite

In this article I want to share the experience of integrating the stemming extension into SQLite code. All actions were performed in the OS Ubuntu 11.10.

Problem


In fts3 SQLite there is a simple stemmer that implements Porter's stemming algorithm , but for Russian words there is no implementation. Those. MATCH for the word 'hotel' will not find records containing the word 'hotel', etc.

Compilation preparation


What is needed

  • sqlite3 sources from the repository ;
  • our C-language stemmer (see below);
  • optional readline library (libreadline), if you need a history of input commands for the console client.


It is further assumed that the sqlite3 sources are in $ HOME / SQLite.

Stemmer Code

Encoding Russian characters UTF-8.
The Stemmer uses the built-in Porter Stemmer for Latin words, and implements a similar algorithm for Russian words.
The code was originally written for C ++ and loaded as an extension for SQLite. I modified it so that it can be compiled on the C language compiler, so it’s very far from beautiful and strict. Here's what happened:
fts3_porter_ext.c
Put our stemmer in $ HOME / SQLite / ext / fts3 / fts3_porter_ext.c

File editing

Makefile.in

Edit the file $ HOME / SQLite / Makefile.in.
  • Add the fts3_porter_ext.lo stemmer to the variable LIBOBJS0
  • Add $ (TOP) /ext/fts3/fts3_porter_ext.c to the SRC variable
  • We write the rule for assembly fts3_porter_ext.lo:
    fts3_porter_ext.lo: $(TOP)/ext/fts3/fts3_porter_ext.c $(HDR) $(EXTHDR)
    $(LTCOMPILE) -DSQLITE_CORE -c $(TOP)/ext/fts3/fts3_porter_ext.c

fts3.c

Edit $ HOME / SQLite / ext / fts3 / fts3.c.
Add a line after the line After the line Add the initialization of our module Finally, after add our module to the hash of the built-in tokens
void sqlite3Fts3PorterTokenizerModule(sqlite3_tokenizer_module const**ppModule);


void sqlite3Fts3PorterTokenizerModule1(sqlite3_tokenizer_module const**ppModule);


sqlite3Fts3PorterTokenizerModule(&pPorter);


const sqlite3_tokenizer_module *pPorter1 = 0;
sqlite3Fts3PorterTokenizerModule1(&pPorter1);


|| sqlite3Fts3HashInsert(pHash, "porter", 7, (void *)pPorter)


|| sqlite3Fts3HashInsert(pHash, "russian", 8, (void *)pPorter1)

mkfts3amal.tcl

Edit $ HOME / SQLite / ext / fts3 / mkfts3amal.tcl
After the line Add
fts3_tokenizer1.c


fts3_porter_ext.c

mksqlite3c.tcl

Edit $ HOME / SQLite / tool / mksqlite3c.tcl
After the line Add
fts3_tokenizer1.c


fts3_porter_ext.c


Compilation


Let's do the following (it’s better to replace --prefix = $ HOME with something more sane. This will be the installation path) Now, check that our stemmer is in sqlite3.c Something like this should turn out: Now install sqlite3 on the computer:
cd $HOME/SQLite && mkdir build && cd build && ../configure --prefix=$HOME CFLAGS='-DSQLITE_SOUNDEX -DSQLITE_ENABLE_FTS3 -DSQLITE_ENABLE_FTS3_PARENTHESIS' && make


grep fts3_porter_ext.c sqlite3.c


/************** Begin file fts3_porter_ext.c *********************************/
/************** End of fts3_porter_ext.c *************************************/


sudo make install


Using


When creating fts3 tables, you need to specify our stemmer, for example like this: Now, with MATCH queries to the tag_fti table, our stemmer will be used.
CREATE VIRTUAL TABLE tag_fti USING fts3(name, tokenize=russian);



Total


We got 2 sqlite3.c and sqlite3.h files that can be connected to our projects.
No need to download extension modules.
We got a console client that correctly processes requests to fts3 tables that our applications will create. The converse is also true that tables created by the console client will be processed by our applications.
I would be glad if the article would be useful for someone.

Upd: corrected links

Also popular now: