Organize relevant search across heterogeneous data using Sphinx

    In one of the current projects, the task of searching for data of various types arose, which was successfully solved with the help of the proven Sphinx search engine , but first things first. 

    Formulation of the problem


    The project currently has 2 zones:
    1. geographic area, implemented on the basis of Google Maps, which displays the geographic features that users put on the map (markers, routes and areas);
    2. information zone, which is a large hierarchically organized directory containing information materials.
    It was necessary to solve the problem of simultaneous text search by 3 types of objects: geographical objects, information zone categories and information zone materials - with the ability to filter by the publication date of objects and the categories to which they relate.

    The solution of the problem


    The whole solution is described for a bunch of PHP5 (Symfony), MySQL, Sphinx. I will not describe how to install Sphinx , this information can be read on the official website. I can only say that under Mac OS X it is easy to install using macports .
    We have such a database model (I simplified it so that it is closer to the point) with some set of entries: We
    configure sphinx to index and display search results:
    1. #articles
    2. source article
    3. {
    4.      type              = mysql
    5.      sql_host          = localhost
    6.      sql_user          = root
    7.      sql_pass          = root
    8.      sql_db             = ili_lv
    9.      sql_sock          = /tmp/mysql/mysql.sock
    10.      sql_query_range    = SELECT MIN(id), MAX(id) FROM article
    11.      sql_range_step     = 500
    12.      sql_query_pre     = SET NAMES utf8
    13.      sql_query         = \
    14.             SELECT id * 10 + 1 as id, category_id, 1 as row_type,\
    15.             UNIX_TIMESTAMP(created_at) as created_at, title, descr \
    16.             FROM article WHERE id >= $start AND id <= $end
    17.      sql_attr_uint     = category_id
    18.      sql_attr_uint     = row_type
    19.      sql_attr_timestamp = created_at
    20.      sql_query_info     = SELECT title, descr \
    21.                          FROM article WHERE id = ($id - 1) / 10
    22. }
    23. #categories
    24. source category
    25. {
    26.   #аналогичный блок параметров подключения к БД
    27.   #...
    28.      sql_query_range    = SELECT MIN(id), MAX(id) FROM category
    29.      sql_range_step     = 500
    30.      sql_query_pre     = SET NAMES utf8
    31.      sql_query         = \
    32.             SELECT id * 10 + 2 as id, tree_parent as category_id, 2 as row_type,\
    33.             UNIX_TIMESTAMP(created_at) as created_at, title, descr \
    34.             FROM category WHERE id >= $start AND id <= $end
    35.      sql_attr_uint     = category_id
    36.      sql_attr_uint     = row_type
    37.      sql_attr_timestamp = created_at
    38.      sql_query_info     = SELECT title, descr \
    39.                          FROM category WHERE id = ($id - 2) / 10
    40. }
    41. #geo_objects
    42. source geo_object
    43. {
    44.   #аналогичный блок параметров подключения к БД
    45.   #...
    46.      sql_query_range    = SELECT MIN(id), MAX(id) FROM geo_object
    47.      sql_range_step     = 500
    48.      sql_query_pre     = SET NAMES utf8
    49.      sql_query         = \
    50.             SELECT id * 10 + 3 as id, 0 as category_id, 3 as row_type,\
    51.             UNIX_TIMESTAMP(created_at) as created_at, title, descr \
    52.             FROM geo_object WHERE id >= $start AND id <= $end
    53.      sql_attr_uint     = category_id
    54.      sql_attr_uint     = row_type
    55.      sql_attr_timestamp = created_at
    56.      sql_query_info     = SELECT title, descr \
    57.                          FROM geo_object WHERE id = ($id - 3) / 10
    58. }
    59. index site_search
    60. {
    61.      source             = category
    62.      source             = geo_object
    63.      source             = article
    64.     
    65.      path              = /var/data/sphinx/site_search
    66.      docinfo            = extern
    67.      morphology         = stem_en, stem_ru
    68.      html_strip         = 0
    69.      charset_type      = utf-8
    70.      min_word_len      = 2
    71. }
    * This source code was highlighted with Source Code Highlighter.

    A little more about the configuration options. The source sections, as the name implies, specify the data stores from where the Sphinx indexed information will be retrieved. Such repositories can be databases, text files, html files, xml and even mailboxes. This section also describes which storage fields will be indexed, in what format indexing will be performed (single or batch sampling) and a number of other parameters. In my case, 3 source is described, all of them lead to the same MySQL database, but to different tables.
    The configuration formats are similar, I will describe the source article .
    1.      sql_query_range    = SELECT MIN(id), MAX(id) FROM article
    2.      sql_range_step     = 500
    * This source code was highlighted with Source Code Highlighter.
    With these lines, we “tell” Sphinx to select from the table not by a full select, but by portions of 500 records, so as not to create excessive load during indexing.

    1.      sql_query         = \
    2.             SELECT id * 10 + 1 as id, category_id, 1 as row_type,\
    3.             UNIX_TIMESTAMP(created_at) as created_at, title, descr \
    4.             FROM article WHERE id >= $start AND id <= $end
    * This source code was highlighted with Source Code Highlighter.
     This is the mask of the request sent by Sphinx when indexing data. 3 points are important here:
    • A set of fields for indexing is determined, in our case these are id, text fields and filter fields;
    • The first field is used by Sphinx as id in the generated index. Because id from different tables may coincide, then such a method of generating a unique id is applied;
    • The row_type field allows you to determine what type each of the stored records in the Sphinx index is.

    The following is a description of the attributes that can be used as filters.
    1.      sql_attr_uint     = category_id
    2.      sql_attr_uint     = row_type
    3.      sql_attr_timestamp = created_at
    * This source code was highlighted with Source Code Highlighter.

    Well, the last parameter is the request mask, which will retrieve the information we need by the found id:
    1.     sql_query_info     = SELECT title, descr \
    2.                          FROM geo_object WHERE id = ($id - 1) / 10
    * This source code was highlighted with Source Code Highlighter.

    Further in the configuration file the most important is described - the indexing parameters of the sources we indicated using the index section .
    1.      source             = category
    2.      source             = geo_object
    3.      source             = article
    * This source code was highlighted with Source Code Highlighter.

    A very important point - the index can be formed from several sources. As shown above, data from three tables is merged into the index. Imagine how you would have to sweat to organize such a search using the database! Here we just can make a request to this index, while receiving its ranked results.

    Lines
    1.      path             = /var/data/sphinx/site_search
    2.      docinfo            = extern
    * This source code was highlighted with Source Code Highlighter.
    The parameters for storing the index and the full path to it are indicated.

    What is another Sphinx charm - it supports English and Russian morphology “out of the box”, allowing you to bring query words to normal form. If necessary, this functionality can be expanded
    1.     morphology         = stem_en, stem_ru
    * This source code was highlighted with Source Code Highlighter.

    The remaining three parameters are responsible for cutting out html tags, index encoding and minimum word length, respectively.

    Further it remains only to start indexing.
    1. muxx:~ muxx$ sudo searchd --stop
    2. Sphinx 0.9.8.1-release (r1533)
    3. Copyright (c) 2001-2008, Andrew Aksyonoff
    4. using config file '/usr/local/etc/sphinx.conf'...
    5. stop: succesfully sent SIGTERM to pid 5677
    6. muxx:~ muxx$ sudo indexer --all
    7. Sphinx 0.9.8.1-release (r1533)
    8. Copyright (c) 2001-2008, Andrew Aksyonoff
    9. using config file '/usr/local/etc/sphinx.conf'...
    10. indexing index 'site_search'...
    11. collected 759 docs, 0.0 MB
    12. sorted 0.0 Mhits, 100.0% done
    13. total 759 docs, 22171 bytes
    14. total 0.028 sec, 785871.25 bytes/sec, 26903.45 docs/sec
    15. muxx:~ muxx$ sudo searchd
    16. Sphinx 0.9.8.1-release (r1533)
    17. Copyright (c) 2001-2008, Andrew Aksyonoff
    18. using config file '/usr/local/etc/sphinx.conf'...
    19. creating server socket on 127.0.0.1:3312
    20. muxx:~ muxx$ search мой сложный запрос
    21. Sphinx 0.9.8.1-release (r1533)
    22. Copyright (c) 2001-2008, Andrew Aksyonoff
    23. using config file '/usr/local/etc/sphinx.conf'...
    24. index 'site_search': query 'мой сложный запрос ': returned 0 matches of 0 total in 0.000 sec
    25. words:
    26. 1. 'мо': 0 documents, 0 hits
    27. 2. 'сложн': 0 documents, 0 hits
    28. 3. 'запрос': 0 documents, 0 hits
    29. muxx:~ muxx$
    * This source code was highlighted with Source Code Highlighter.

    In the listing above, we first stop the daemon in case it is running. Then we perform indexing. You can see how fast the indexing speed of Sphinx is.

    // The comments suggested that you can index without stopping the daemon with the sudo indexer --rotate --all command .

    Then run the daemon and run a test request. Sphinx shows how it splits a query and normalizes the words in it. In my example, it worked fine, but didn’t find anything :)

    After making sure that the daemon is working, you can work with Sphinx from Symfony.
    Install the sfSphinxPlugin plugin , connect it in the configurations:
    1. $this->enablePlugins(array('sfSphinxPlugin'));
    * This source code was highlighted with Source Code Highlighter.

    and write a small example request to the daemon:
    1.   $sphinx = new sfSphinxClient($options);
    2.         
    3.   //устанавливаем числовые фильтры, если они заданы
    4.   if ($request->getParameter('category_id'))
    5.      $sphinx->setFilter('category_id', array($request->getParameter('category_id')));
    6.   if ($request->getParameter('row_type'))
    7.      $sphinx->setFilter('row_type', array($request->getParameter('row_type')));
    8.   $dateRange = $request->getParameter('date');
    9.   if ($dateRange['from'] || $dateRange['to'])
    10.   {
    11.      $sphinx->setFilterRange('created_at',
    12.                              !empty($dateRange['from']) ? strtotime($dateRange['from']) : '',
    13.                              !empty($dateRange['to'])  ? strtotime($dateRange['to'])  : '');
    14.   }
    15.   $this->results = $sphinx->Query($request->getParameter('s'), 'site_search');
    16.   if ($this->results === false)
    17.   {
    18.      $this->message = 'Запрос не выполнен: ' . $sphinx->GetLastError();
    19.   }
    20.   else
    21.      //если все путём, то достаем информацию по id индекса
    22.      //и выводим ее в template
    23.      $this->items = $this->retrieveResultRows($this->results);
    * This source code was highlighted with Source Code Highlighter.

    I hope that from my description you can appreciate all the charms of Sphinx, I didn’t talk about all its features, you can already learn the rest yourself if you wish.

    PS: A request to those who have enough karma - create a Sphinx blog, I would transfer the article there.

    PS2: Thanks everyone! The blog is created, the topic is moved there.

    Also popular now: