Search Mediawiki with Sphinx

Hello reader!
Some time ago I was tasked with implementing MediaWiki on a corporate network.
And the main problem with this implementation was the search for information contained in the wiki.
In this article, I would like to talk about how to make Sphinx search friends with MediaWiki.
The reason I would like to write this is the lack of Russian-language documentation and a more or less decent guide or description that would help my colleagues quickly and easily start using this wonderful search engine.
Maybe I just don’t know how to use Google ...
Why is it necessary
The purpose of this implementation in our organization is to transfer the corporate knowledge base to a more convenient presentation and correction / addition format.
By the way, our company implements projects for the automation of document management in the complex.
Customers are large, solutions are complex and sometimes non-standard.
And wiki articles are supposed to have not only about projects, but also about technical solutions, features, etc., innovative methods and technologies.
As well as plans to use it as a source of information for new employees, they have to study a fairly decent amount of information and the convenience of access to it at the moment leaves much to be desired.
As mentioned above - the popular MediaWiki engine was taken as the basis. And the key problem that I predicted at the very beginning was the problem of finding information.
Everyone knows - the standard search is really bad. And the question became natural - how to fix this misunderstanding.
Training
So, everything is deployed on Windows Server 2012 R2 64bit, naturally IIS is raised:

The latest versions at the time of installation. The SphinxSearch extension on the screen is already connected. How to do this I will write a little lower.
You must download the search engine itself from the official site . I chose 2.1.9-release (July 2014).
You also need to download the extension for MediaWiki.
I took it to GIT WikiMedia
Version 0.9.0 was relevant.
Install and configure the Sphinx search engine
After downloading the engine, I unpacked it in C: \ inetpub \ wwwroot \ mw \ sphinx).
The next step is to prepare the config. As a basis, I took the sphinx.conf.in file
. I got such a working one, which I give here with comments.
# data source definition for the main index
source src_wiki_main
{
type = mysql
# data source
sql_host= 127.0.0.1 # localhost не работает в силу специфики Win7+ ветки
sql_user= mwuser
sql_pass=
sql_db=
sql_port= 3306# optional, default is 3306
# pre-query, executed before the main fetch query. Дабы понималась кодировка в базе
sql_query_pre= SET NAMES utf8
# main document fetch query - change the table names if you are using a prefix
# Этот и последующий запросы предоставлены самим разработчиком расширения.
sql_query= SELECT page_id, page_title, page_namespace, page_is_redirect, old_id, old_text FROM page, revision, text WHERE rev_id=page_latest AND old_id=rev_text_id
# attribute columns
sql_attr_uint= page_namespace
sql_attr_uint= page_is_redirect
sql_attr_uint= old_id
# collect all category ids for category filtering
sql_attr_multi = uint category from query; SELECT cl_from, page_id AS category FROM categorylinks, page WHERE page_title=cl_to AND page_namespace=14
# used by command-line search utility to display document information
sql_query_info= SELECT page_title, page_namespace FROM page WHERE page_id=$id
}
# data source definition for the incremental index
source src_wiki_incremental : src_wiki_main
{
# adjust this query based on the time you run the full index
# in this case, full index runs at 7 AM UTC
sql_query= SELECT page_id, page_title, page_namespace, page_is_redirect, old_id, old_text FROM page, revision, text WHERE rev_id=page_latest AND old_id=rev_text_id AND page_touched>=DATE_FORMAT(CURDATE(), '%Y%m%d070000')
# Тип поиска должен быть plain
type = plain
}
# main index definition
index wiki_main
{
type = plain
# which document source to index
source= src_wiki_main
# this is path and index file name without extension
# you may need to change this path or create this folder
path= C:/inetpub/wwwroot/mw/sphinx/data/wiki_main
# docinfo (ie. per-document attribute values) storage strategy
docinfo= extern
# morphology
morphology= stem_en, stem_ru
# stopwords file
#stopwords= /var/data/sphinx/stopwords.txt
# minimum word length
min_word_len= 1
# allow wildcard (*) searches
min_infix_len = 1
enable_star = 1
# charset encoding type
charset_type= utf-8
# charset definition and case folding rules "table"
# Это позволяет включать поиск по русскоязычным источникам. По умолчанию он не работает без этой магии.
charset_table= 0..9, A..Z->a..z, a..z, \
U+C0->a, U+C1->a, U+C2->a, U+C3->a, U+C4->a, U+C5->a, U+C6->a, \
U+C7->c,U+E7->c, U+C8->e, U+C9->e, U+CA->e, U+CB->e, U+CC->i, \
U+CD->i, U+CE->i, U+CF->i, U+D0->d, U+D1->n, U+D2->o, U+D3->o, \
U+D4->o, U+D5->o, U+D6->o, U+D8->o, U+D9->u, U+DA->u, U+DB->u, \
U+DC->u, U+DD->y, U+DE->t, U+DF->s, \
U+E0->a, U+E1->a, U+E2->a, U+E3->a, U+E4->a, U+E5->a, U+E6->a, \
U+E7->c,U+E7->c, U+E8->e, U+E9->e, U+EA->e, U+EB->e, U+EC->i, \
U+ED->i, U+EE->i, U+EF->i, U+F0->d, U+F1->n, U+F2->o, U+F3->o, \
U+F4->o, U+F5->o, U+F6->o, U+F8->o, U+F9->u, U+FA->u, U+FB->u, \
U+FC->u, U+FD->y, U+FE->t, U+FF->s, U+410..U+42F->U+430..U+44F, \
U+430..U+44F, U+0400->U+0435, U+0401->U+0435, U+0402->U+0452, \
U+0452, U+0403->U+0433, U+0404->U+0454, U+0454, U+0405->U+0455, \
U+0455, U+0406->U+0456, U+0407->U+0456, U+0457->U+0456, U+0456, \
U+0408..U+040B->U+0458..U+045B, U+0458..U+045B, U+040C->U+043A, \
U+040D->U+0438, U+040E->U+0443, U+040F->U+045F, U+045F, \
U+0450->U+0435, U+0451->U+0435, U+0453->U+0433, U+045C->U+043A, \
U+045D->U+0438, U+045E->U+0443, U+0460->U+0461, U+0461, U+0462->U+0463, \
U+0463, U+0464->U+0465, U+0465, U+0466->U+0467, U+0467, U+0468->U+0469, \
U+0469, U+046A->U+046B, U+046B, U+046C->U+046D, U+046D, U+046E->U+046F, \
U+046F, U+0470->U+0471, U+0471, U+0472->U+0473, U+0473, U+0474->U+0475, \
U+0476->U+0475, U+0477->U+0475, U+0475, U+0478->U+0479, U+0479, \
U+047A->U+047B, U+047B, U+047C->U+047D, U+047D, U+047E->U+047F, U+047F, \
U+0480->U+0481, U+0481, U+048A->U+0438, U+048B->U+0438, U+048C->U+044C, \
U+048D->U+044C, U+048E->U+0440, U+048F->U+0440, U+0490->U+0433, \
U+0491->U+0433, U+0490->U+0433, U+0491->U+0433, U+0492->U+0433, \
U+0493->U+0433, U+0494->U+0433, U+0495->U+0433, U+0496->U+0436, \
U+0497->U+0436, U+0498->U+0437, U+0499->U+0437, U+049A->U+043A, \
U+049B->U+043A, U+049C->U+043A, U+049D->U+043A, U+049E->U+043A, \
U+049F->U+043A, U+04A0->U+043A, U+04A1->U+043A, U+04A2->U+043D, \
U+04A3->U+043D, U+04A4->U+043D, U+04A5->U+043D, U+04A6->U+043F, \
U+04A7->U+043F, U+04A8->U+04A9, U+04A9, U+04AA->U+0441, U+04AB->U+0441, \
U+04AC->U+0442, U+04AD->U+0442, U+04AE->U+0443, U+04AF->U+0443, U+04B0->U+0443, \
U+04B1->U+0443, U+04B2->U+0445, U+04B3->U+0445, U+04B4->U+04B5, U+04B5, \
U+04B6->U+0447, U+04B7->U+0447, U+04B8->U+0447, U+04B9->U+0447, U+04BA->U+04BB, \
U+04BB, U+04BC->U+04BD, U+04BE->U+04BD, U+04BF->U+04BD, U+04BD, U+04C0->U+04CF, \
U+04CF, U+04C1->U+0436, U+04C2->U+0436, U+04C3->U+043A, U+04C4->U+043A, \
U+04C5->U+043B, U+04C6->U+043B, U+04C7->U+043D, U+04C8->U+043D, U+04C9->U+043D, \
U+04CA->U+043D, U+04CB->U+0447, U+04CC->U+0447, U+04CD->U+043C, U+04CE->U+043C, \
U+04D0->U+0430, U+04D1->U+0430, U+04D2->U+0430, U+04D3->U+0430, U+04D4->U+00E6, \
U+04D5->U+00E6, U+04D6->U+0435, U+04D7->U+0435, U+04D8->U+04D9, U+04DA->U+04D9, \
U+04DB->U+04D9, U+04D9, U+04DC->U+0436, U+04DD->U+0436, U+04DE->U+0437, \
U+04DF->U+0437, U+04E0->U+04E1, U+04E1, U+04E2->U+0438, U+04E3->U+0438, \
U+04E4->U+0438, U+04E5->U+0438, U+04E6->U+043E, U+04E7->U+043E, U+04E8->U+043E, \
U+04E9->U+043E, U+04EA->U+043E, U+04EB->U+043E, U+04EC->U+044D, U+04ED->U+044D, \
U+04EE->U+0443, U+04EF->U+0443, U+04F0->U+0443, U+04F1->U+0443, U+04F2->U+0443, \
U+04F3->U+0443, U+04F4->U+0447, U+04F5->U+0447, U+04F6->U+0433, U+04F7->U+0433, \
U+04F8->U+044B, U+04F9->U+044B, U+04FA->U+0433, U+04FB->U+0433, U+04FC->U+0445, \
U+04FD->U+0445, U+04FE->U+0445, U+04FF->U+0445, U+0410..U+0418->U+0430..U+0438, \
U+0419->U+0438, U+0430..U+0438, U+041A..U+042F->U+043A..U+044F, U+043A..U+044F,
}
# incremental index definition
index wiki_incremental : wiki_main
{
type = plain
path= C:/inetpub/wwwroot/mw/sphinx/data/wiki_incremental
}
# indexer settings
indexer
{
# memory limit (default is 32M)
mem_limit= 64M
}
# searchd settings
searchd
{
# IP address and port on which search daemon will bind and accept
listen= 127.0.0.1:9312
# searchd run info is logged here - create or change the folder
log= C:/inetpub/wwwroot/mw/sphinx/log/searchd.log
# all the search queries are logged here
query_log= C:/inetpub/wwwroot/mw/sphinx/log/query.log
# client read timeout, seconds
read_timeout= 5
# maximum amount of children to fork
max_children= 30
# a file which will contain searchd process ID
pid_file= C:/inetpub/wwwroot/mw/sphinx/log/searchd.pid
# maximum amount of matches this daemon would ever retrieve
# from each index and serve to client
max_matches= 1000
workers = threads
}
# --eof--
This completes the configuration of the sphinx.
Search service installation
Now install our service.
To do this, we write on the command line.
C:/inetpub/wwwroot/mw/sphinx/bin/searchd --install --config C:/inetpub/wwwroot/mw/sphinx/bin/sphinx.conf --servicename SphinxSearch
Everything should go without errors and the service should install and become visible through Administration - Services under the name SphinxSearch.
While it is not worth running, because the data has not yet been indexed and we will get an error when starting the service.
It is worth noting that slashes are used just such /, and not such \. Otherwise, an error will appear when accessing the log files and PID files of the search engine processes.
I also draw attention to the fact that the conf file lies in the binaries folder (bin), so that when starting through the console, do not write the path to the config.
But when installing the service, it is better to write along which path the config is.
Now at the command prompt, go to the binaries folder (bin) and write
indexer --all
We get a result like this:
Sphinx 2.1.9-release (r4761)
Copyright (c) 2001-2014, Andrew Aksyonoff
Copyright (c) 2008-2014, Sphinx Technologies Inc (http://sphinxsearch.com)
using config file './sphinx.conf'...
indexing index 'wiki_main'...
collected 159 docs, 0.5 MB
collected 0 attr values
sorted 0.0 Mvalues, 100.0% done
sorted 1.6 Mhits, 100.0% done
total 159 docs, 494176 bytes
total 0.596 sec, 827807 bytes/sec, 266.34 docs/sec
indexing index 'wiki_incremental'...
collected 159 docs, 0.5 MB
collected 0 attr values
sorted 0.0 Mvalues, 100.0% done
sorted 1.6 Mhits, 100.0% done
total 159 docs, 494176 bytes
total 0.584 sec, 844808 bytes/sec, 271.81 docs/sec
total 4 reads, 0.005 sec, 2107.7 kb/call avg, 1.4 msec/call avg
total 38 writes, 0.022 sec, 479.7 kb/call avg, 0.5 msec/call avg
That's it, the index is created.
Checking the operation of the search engine
As it turned out above - the index was created. On the command line, we are still in the binaries folder. Now we start our SphinxSearch service and at the command line we write something like:
search wiki
I got this result:
Sphinx 2.1.9-release (r4761)
Copyright (c) 2001-2014, Andrew Aksyonoff
Copyright (c) 2008-2014, Sphinx Technologies Inc (http://sphinxsearch.com)
using config file './sphinx.conf'...
index 'wiki_main': query 'wiki ': returned 13 matches of 13 total in 0.004 sec
displaying matches:
1. document=76, weight=1719, page_namespace=0, page_is_redirect=0, old_id=929, c
ategory=()
page_title=???????_CompanyNameWiki
page_namespace=0
2. document=77, weight=1670, page_namespace=0, page_is_redirect=0, old_id=1136,
category=()
page_title=FAQ_CompanyNameWiki
page_namespace=0
3. document=79, weight=1670, page_namespace=0, page_is_redirect=0, old_id=864, c
ategory=()
page_title=CompanyNameWiki:_?????
page_namespace=0
4. document=81, weight=1670, page_namespace=12, page_is_redirect=0, old_id=939,
category=()
page_title=C???????_?????_??????
page_namespace=12
5. document=128, weight=1670, page_namespace=0, page_is_redirect=0, old_id=1075,
category=()
page_title=?????
page_namespace=0
6. document=1, weight=1648, page_namespace=0, page_is_redirect=0, old_id=1091, c
ategory=()
page_title=?????????_????????
page_namespace=0
7. document=4, weight=1648, page_namespace=0, page_is_redirect=0, old_id=10, cat
egory=()
page_title=?????????_????????
page_namespace=0
8. document=5, weight=1648, page_namespace=0, page_is_redirect=0, old_id=181, ca
tegory=()
page_title=?????????_?????????_????_(???????_??????)
page_namespace=0
9. document=2, weight=1608, page_namespace=8, page_is_redirect=0, old_id=1135, c
ategory=()
page_title=Sidebar
page_namespace=8
10. document=12, weight=1608, page_namespace=0, page_is_redirect=0, old_id=719,
category=()
page_title=?????????_CRM
page_namespace=0
11. document=71, weight=1608, page_namespace=0, page_is_redirect=0, old_id=701,
category=()
page_title=??????_???????
page_namespace=0
12. document=80, weight=1608, page_namespace=12, page_is_redirect=0, old_id=862,
category=()
page_title=?????????_CompanyNameWiki
page_namespace=12
13. document=129, weight=1608, page_namespace=0, page_is_redirect=0, old_id=1085
, category=()
page_title=????
page_namespace=0
words:
1. 'wiki': 13 documents, 37 hits
index 'wiki_incremental': query 'wiki ': returned 13 matches of 13 total in 0.00
0 sec
displaying matches:
1. document=76, weight=1719, page_namespace=0, page_is_redirect=0, old_id=929, c
ategory=()
page_title=???????_CompanyNameWiki
page_namespace=0
2. document=77, weight=1670, page_namespace=0, page_is_redirect=0, old_id=1136,
category=()
page_title=FAQ_CompanyNameWiki
page_namespace=0
3. document=79, weight=1670, page_namespace=0, page_is_redirect=0, old_id=864, c
ategory=()
page_title=CompanyNameWiki:_?????
page_namespace=0
4. document=81, weight=1670, page_namespace=12, page_is_redirect=0, old_id=939,
category=()
page_title=C???????_?????_??????
page_namespace=12
5. document=128, weight=1670, page_namespace=0, page_is_redirect=0, old_id=1075,
category=()
page_title=?????
page_namespace=0
6. document=1, weight=1648, page_namespace=0, page_is_redirect=0, old_id=1091, c
ategory=()
page_title=?????????_????????
page_namespace=0
7. document=4, weight=1648, page_namespace=0, page_is_redirect=0, old_id=10, cat
egory=()
page_title=?????????_????????
page_namespace=0
8. document=5, weight=1648, page_namespace=0, page_is_redirect=0, old_id=181, ca
tegory=()
page_title=?????????_?????????_????_(???????_??????)
page_namespace=0
9. document=2, weight=1608, page_namespace=8, page_is_redirect=0, old_id=1135, c
ategory=()
page_title=Sidebar
page_namespace=8
10. document=12, weight=1608, page_namespace=0, page_is_redirect=0, old_id=719,
category=()
page_title=?????????_CRM
page_namespace=0
11. document=71, weight=1608, page_namespace=0, page_is_redirect=0, old_id=701,
category=()
page_title=??????_???????
page_namespace=0
12. document=80, weight=1608, page_namespace=12, page_is_redirect=0, old_id=862,
category=()
page_title=?????????_CompanyNameWiki
page_namespace=12
13. document=129, weight=1608, page_namespace=0, page_is_redirect=0, old_id=1085
, category=()
page_title=????
page_namespace=0
words:
1. 'wiki': 13 documents, 37 hits
Due to the fact that there is a difference in the encodings, they received "?????", and not Russian letters. BUT the main thing is the extradition. So the search works!
That's all, we installed sphinx, indexed our database and have a working search engine!
Index Update Automation
For the search to work fully, it is also necessary to ensure regular updating of the index - after all, articles are added and it is necessary to ensure their availability in search results as well.
To do this, in the task scheduler, create a task with a regular launch (I have 5 minutes) of a bat file with the following contents:
c:\inetpub\wwwroot\mw\sphinx\bin\indexer --all --config c:\inetpub\wwwroot\mw\sphinx\bin\sphinx.conf --rotate
I made a task launch on behalf of the local administrator. First, you must explicitly assign the rights to the entire sphinx folder.
Enable Sphinx Search in Mediawiki
Now you need to connect the search engine to Mediawiki. Otherwise, the latter doesn’t know what to look for not with the built-in mechanism, but with the help of the sphinx.
We go to the LocalSettings.php file (Lies in a folder with media wiki) and add:
#Sphinx search
$wgSearchType = 'SphinxMWSearch';
require_once "$IP/extensions/SphinxSearch/SphinxSearch.php";
$wgSphinxSearch_host = "127.0.0.1";
$wgSphinxSearch_port = 9312;
$wgSphinxSearch_matches = 50;
$wgEnableSphinxPrefixSearch = true;
$wgFooterIcons['poweredby']['sphinxsearch'] = array(
'src' => "$wgScriptPath/extensions/SphinxSearch/skins/images/Powered_by_sphinx.png",
'url' => 'http://www.mediawiki.org/wiki/Extension:SphinxSearch',
'alt' => 'Search Powered by Sphinx',
);
Create a new folder in the extensions folder named SphinxSearch. vedmaka
left an important note : Add: after installing sphinx you need to go to http://sphinxsearch.com/downloads/archive/ , download the corresponding version sources from there and put the sphinxapi.php file into the directory with the SphinxSearch extension. Save. We restart the site through the IIS manager. Checking the search by hand through the Mediawiki web page. Everything should work. Issue when typing in the search string. And the search results themselves.


Conclusion
As a result, we got a better search on the materials in our wiki.
In the default output, sorting takes the value SPH_SORT_RELEVANCE.
If desired, it can be changed by explicitly specifying in the file
LocalSettings.php
through the $wgSphinxSearch_sortby
More information on various options for sorting the output you can read in this section of the documentation .
In this article, I used not only personal best practices, but also information collected during the implementation of work with this search engine.
I did not consider possible errors that may occur in the process since I considered it correct to share the working configuration, as well as the sequence of actions that ultimately lead to the solution as a whole. And there were a lot of errors, starting from the lack of rights to files, “not those” slashes and ending with the inoperability of the Sphinx configuration, which is bundled with the extension.