
Organize relevant search across heterogeneous data using Sphinx

Formulation of the problem
The project currently has 2 zones:
- geographic area, implemented on the basis of Google Maps, which displays the geographic features that users put on the map (markers, routes and areas);
- information zone, which is a large hierarchically organized directory containing information materials.
The solution of the problem
The whole solution is described for a bunch of PHP5 (Symfony), MySQL, Sphinx. I will not describe how to install Sphinx , this information can be read on the official website. I can only say that under Mac OS X it is easy to install using macports .
We have such a database model (I simplified it so that it is closer to the point) with some set of entries: We

configure sphinx to index and display search results:
- #articles
- source article
- {
- type = mysql
- sql_host = localhost
- sql_user = root
- sql_pass = root
- sql_db = ili_lv
- sql_sock = /tmp/mysql/mysql.sock
- sql_query_range = SELECT MIN(id), MAX(id) FROM article
- sql_range_step = 500
- sql_query_pre = SET NAMES utf8
- sql_query = \
- SELECT id * 10 + 1 as id, category_id, 1 as row_type,\
- UNIX_TIMESTAMP(created_at) as created_at, title, descr \
- FROM article WHERE id >= $start AND id <= $end
- sql_attr_uint = category_id
- sql_attr_uint = row_type
- sql_attr_timestamp = created_at
- sql_query_info = SELECT title, descr \
- FROM article WHERE id = ($id - 1) / 10
- }
- #categories
- source category
- {
- #аналогичный блок параметров подключения к БД
- #...
- sql_query_range = SELECT MIN(id), MAX(id) FROM category
- sql_range_step = 500
- sql_query_pre = SET NAMES utf8
- sql_query = \
- SELECT id * 10 + 2 as id, tree_parent as category_id, 2 as row_type,\
- UNIX_TIMESTAMP(created_at) as created_at, title, descr \
- FROM category WHERE id >= $start AND id <= $end
- sql_attr_uint = category_id
- sql_attr_uint = row_type
- sql_attr_timestamp = created_at
- sql_query_info = SELECT title, descr \
- FROM category WHERE id = ($id - 2) / 10
- }
- #geo_objects
- source geo_object
- {
- #аналогичный блок параметров подключения к БД
- #...
- sql_query_range = SELECT MIN(id), MAX(id) FROM geo_object
- sql_range_step = 500
- sql_query_pre = SET NAMES utf8
- sql_query = \
- SELECT id * 10 + 3 as id, 0 as category_id, 3 as row_type,\
- UNIX_TIMESTAMP(created_at) as created_at, title, descr \
- FROM geo_object WHERE id >= $start AND id <= $end
- sql_attr_uint = category_id
- sql_attr_uint = row_type
- sql_attr_timestamp = created_at
- sql_query_info = SELECT title, descr \
- FROM geo_object WHERE id = ($id - 3) / 10
- }
- index site_search
- {
- source = category
- source = geo_object
- source = article
-
- path = /var/data/sphinx/site_search
- docinfo = extern
- morphology = stem_en, stem_ru
- html_strip = 0
- charset_type = utf-8
- min_word_len = 2
- }
* This source code was highlighted with Source Code Highlighter.
A little more about the configuration options. The source sections, as the name implies, specify the data stores from where the Sphinx indexed information will be retrieved. Such repositories can be databases, text files, html files, xml and even mailboxes. This section also describes which storage fields will be indexed, in what format indexing will be performed (single or batch sampling) and a number of other parameters. In my case, 3 source is described, all of them lead to the same MySQL database, but to different tables.
The configuration formats are similar, I will describe the source article .
- sql_query_range = SELECT MIN(id), MAX(id) FROM article
- sql_range_step = 500
* This source code was highlighted with Source Code Highlighter.
With these lines, we “tell” Sphinx to select from the table not by a full select, but by portions of 500 records, so as not to create excessive load during indexing.- sql_query = \
- SELECT id * 10 + 1 as id, category_id, 1 as row_type,\
- UNIX_TIMESTAMP(created_at) as created_at, title, descr \
- FROM article WHERE id >= $start AND id <= $end
* This source code was highlighted with Source Code Highlighter.
This is the mask of the request sent by Sphinx when indexing data. 3 points are important here:- A set of fields for indexing is determined, in our case these are id, text fields and filter fields;
- The first field is used by Sphinx as id in the generated index. Because id from different tables may coincide, then such a method of generating a unique id is applied;
- The row_type field allows you to determine what type each of the stored records in the Sphinx index is.
The following is a description of the attributes that can be used as filters.
- sql_attr_uint = category_id
- sql_attr_uint = row_type
- sql_attr_timestamp = created_at
* This source code was highlighted with Source Code Highlighter.
Well, the last parameter is the request mask, which will retrieve the information we need by the found id:
- sql_query_info = SELECT title, descr \
- FROM geo_object WHERE id = ($id - 1) / 10
* This source code was highlighted with Source Code Highlighter.
Further in the configuration file the most important is described - the indexing parameters of the sources we indicated using the index section .
- source = category
- source = geo_object
- source = article
* This source code was highlighted with Source Code Highlighter.
A very important point - the index can be formed from several sources. As shown above, data from three tables is merged into the index. Imagine how you would have to sweat to organize such a search using the database! Here we just can make a request to this index, while receiving its ranked results.
Lines
- path = /var/data/sphinx/site_search
- docinfo = extern
* This source code was highlighted with Source Code Highlighter.
The parameters for storing the index and the full path to it are indicated. What is another Sphinx charm - it supports English and Russian morphology “out of the box”, allowing you to bring query words to normal form. If necessary, this functionality can be expanded .
- morphology = stem_en, stem_ru
* This source code was highlighted with Source Code Highlighter.
The remaining three parameters are responsible for cutting out html tags, index encoding and minimum word length, respectively.
Further it remains only to start indexing.
- muxx:~ muxx$ sudo searchd --stop
- Sphinx 0.9.8.1-release (r1533)
- Copyright (c) 2001-2008, Andrew Aksyonoff
- using config file '/usr/local/etc/sphinx.conf'...
- stop: succesfully sent SIGTERM to pid 5677
- muxx:~ muxx$ sudo indexer --all
- Sphinx 0.9.8.1-release (r1533)
- Copyright (c) 2001-2008, Andrew Aksyonoff
- using config file '/usr/local/etc/sphinx.conf'...
- indexing index 'site_search'...
- collected 759 docs, 0.0 MB
- sorted 0.0 Mhits, 100.0% done
- total 759 docs, 22171 bytes
- total 0.028 sec, 785871.25 bytes/sec, 26903.45 docs/sec
- muxx:~ muxx$ sudo searchd
- Sphinx 0.9.8.1-release (r1533)
- Copyright (c) 2001-2008, Andrew Aksyonoff
- using config file '/usr/local/etc/sphinx.conf'...
- creating server socket on 127.0.0.1:3312
- muxx:~ muxx$ search мой сложный запрос
- Sphinx 0.9.8.1-release (r1533)
- Copyright (c) 2001-2008, Andrew Aksyonoff
- using config file '/usr/local/etc/sphinx.conf'...
- index 'site_search': query 'мой сложный запрос ': returned 0 matches of 0 total in 0.000 sec
- words:
- 1. 'мо': 0 documents, 0 hits
- 2. 'сложн': 0 documents, 0 hits
- 3. 'запрос': 0 documents, 0 hits
- muxx:~ muxx$
* This source code was highlighted with Source Code Highlighter.
In the listing above, we first stop the daemon in case it is running. Then we perform indexing. You can see how fast the indexing speed of Sphinx is.
// The comments suggested that you can index without stopping the daemon with the sudo indexer --rotate --all command .
Then run the daemon and run a test request. Sphinx shows how it splits a query and normalizes the words in it. In my example, it worked fine, but didn’t find anything :)
After making sure that the daemon is working, you can work with Sphinx from Symfony.
Install the sfSphinxPlugin plugin , connect it in the configurations:
- $this->enablePlugins(array('sfSphinxPlugin'));
* This source code was highlighted with Source Code Highlighter.
and write a small example request to the daemon:
- $sphinx = new sfSphinxClient($options);
-
- //устанавливаем числовые фильтры, если они заданы
- if ($request->getParameter('category_id'))
- $sphinx->setFilter('category_id', array($request->getParameter('category_id')));
- if ($request->getParameter('row_type'))
- $sphinx->setFilter('row_type', array($request->getParameter('row_type')));
- $dateRange = $request->getParameter('date');
- if ($dateRange['from'] || $dateRange['to'])
- {
- $sphinx->setFilterRange('created_at',
- !empty($dateRange['from']) ? strtotime($dateRange['from']) : '',
- !empty($dateRange['to']) ? strtotime($dateRange['to']) : '');
- }
- $this->results = $sphinx->Query($request->getParameter('s'), 'site_search');
- if ($this->results === false)
- {
- $this->message = 'Запрос не выполнен: ' . $sphinx->GetLastError();
- }
- else
- //если все путём, то достаем информацию по id индекса
- //и выводим ее в template
- $this->items = $this->retrieveResultRows($this->results);
* This source code was highlighted with Source Code Highlighter.
I hope that from my description you can appreciate all the charms of Sphinx, I didn’t talk about all its features, you can already learn the rest yourself if you wish.
PS: A request to those who have enough karma - create a Sphinx blog, I would transfer the article there.
PS2: Thanks everyone! The blog is created, the topic is moved there.