Setting up Sphinx search for an online store

Information on Sphinx is not as much as we would like. Excess article does not hurt.
The first steps in mastering Sphinx helped me to make articles. Creating an exploratory search engine on Sphinx + php and an example of Sphinx search on a real project - Tecdoc auto parts store .


Some time on my site worked search through LIKE for each word of the query. I wanted more, and here are the cases that will now be handled correctly:


  • Word forms. The issuance of "screws" and "screws" should be the same.
  • Search by word fragment.
  • Search for non-integer numbers. Separator point and comma.
  • Letter yo
  • Typical mistakes. For example, "shock absorber."
  • Synonyms. Regulator and ESC.
  • Tongue. mAh and mAh, B and V, AAA in Latin and Cyrillic.
  • Word of letters and numbers. 10x15x4, 6000mAh

Source section and additional sorting


Issuance must first contain positions available, then temporarily absent, then archived. And all these three groups should be sorted by relevance. To do this, you must specify the attributes. In my case, these are the clearance and in_stock fields of the source section sphinx.conf


sql_query = \
    SELECTid, `art`, `name`, `clearance`, `in_stock` \
    FROM items_zip WHERE show_flag=1
sql_attr_bool = clearance
sql_attr_uint = in_stock

These fields will be used in generating the output in PHP. I will describe below.


The index section in sphinx.conf


morphology = stem_enru
Morphology solves my first task. Searching for 'bearings', 'bearing', 'bearings' will lead to a single result.


Stems (stem_enru) are faster, lemmas (lemmatize_ru) are more accurate. I tried only stammy. The choice will affect your wordforms replacement dictionary. Want to change - will have to rewrite.


min_word_len = 1 We index
words of any length.


html_strip = 1
Remove html tags


min_infix_len = 1 The
search will be on a word fragment. Let's index fragments up to 1 letter. Since I have a base of less than 10,000 items, I don’t save on the index.


expand_keywords = 1
Automatically leads the query to the form "(running | running | = running)". min_infix_len and expand_keywords will cause the RV 2205 request to issue the RV2205. By the way, a dash is a separator equivalent to a space. So the RV-2205 will issue the same RV2205.


charset_table = 0..9, A..Z-> a..z, _, a..z, U + 410..U + 42F-> U + 430..U + 44F, U + 430..U + 44F, U + 401-> U + 0435, U + 451-> U + 0435
We present the Latin and Cyrillic characters in lower case. We replace it with e.


blend_chars = +, &, U + 2C, U + 2E
I have a lot of non-integer numbers. They need to be indexed completely. U + 2C and U + 2E is a period and comma. For example, 1.25 will be indexed as '1.25', '1' and '25'.


regexp_filter = (\ d +) \, (\ d +) => \ 1. \ 2
Decimals in numbers can be separated by dots and commas: "1.75", "1.75". Let's get everything to the point


Synonyms and Typos


Units of measurement can be written in Russian or English: mm-mm, mAh-mAh, mW-mW. We add to the dictionary of synonyms, the path to which is specified in wordforms: "moc> mah". I choose the language for the index according to my own preferences.


The ~ sign indicates to apply the replacement after the morphology handler. This allows you not to write all the word forms and instead of the rules for 'peel', 'peel', 'peel' write "~ cork> body"


My list is complete:


~регулятор > esc 
регуль > esc
мач > mah
~корк > кузов
~корпус  > кузов
~пищалк > buzz
~бузер > buzz
~буззер > buzz
~зуммер > buzz
~зумер > buzz
~бальс > бальз
~двигатель > мотор
~электродвигатель  > мотор
li-po > lipo
~аммортизатор > амортизатор 
~зарядк > зарядн
серво > сервопривод
серва > сервопривод
vtx > видеопередатчик
~антен > антенн
lollipop > lolipop
battery > аккумулятор
~пульт > аппаратур
~безколлекторн > бесколлекторн
~пиньен > пиньон
mkF > мкФ
бек > BEC
бэк > BEC
~термоусадк > термоусадочн
LED > светодиод
~светодиодн > светодиод 
driver > драйвер
~пакет > сумк
~пропеллер > лопаст
ААА > AAA
АА > AA
М > M
mm > мм
мВт > mW
В > V
А > A
deans > t-plug
tplug > t-plug  

Sticking letters to numbers


Sometimes numbers are part of the name (for example LCD5208D), but more often the characteristic (100mAh, 10x15x4mm). Separate all numbers from letters and index them.


This will solve several problems:


  • Someone will look for a '10x15x4' bearing, someone will look for a '15x10x4' bearing. Indexed numbers will lead to the correct issue.
  • Units of measurement may or may not be separated by a space from the number: "1.75mm", "1.75 mm".
  • For titles this is also useful. The correct output will be on the three recording options LCD-5208, LCD 5208 and LCD5208

Before writing a regular expression for separating numbers, you need to unify the delimiters. It is important to remember that regular expressions are all executed sequentially.


We remove X, He and a star in sizes such as 10x15x4 M3x10:


regexp_filter = (\d+)[x\x{0445}\*] => \1 x

Drop tails:


regexp_filter = (\d*\.?\d+)(\D+) => \1 \2

And heads:


regexp_filter = (\D+)(\d*\.?\d+) => \1 \2

Let's drop the "mm", as they are often not listed in the product name.
Make the stop.txt file and write it in stopwords.
Content:


мм
mm

Now a little about PHP


Sphinxapi will sooner or later be depricated. We will use Sphinxql. For this you need to connect to the database. In my case Sphinx connected via hosting it looks like this:


$opt  = array(
    PDO::ATTR_ERRMODE => PDO::ERRMODE_EXCEPTION,
    PDO::ATTR_DEFAULT_FETCH_MODE => PDO::FETCH_ASSOC,
    PDO::ATTR_EMULATE_PREPARES   => TRUE,
);
$dsn = 'mysql:host=127.0.0.1;port=9306;';
$this->pdo = new PDO($dsn, DB_USER, DB_PASS, $opt);  

And all communication with Spinxql is one SELECT sending the filtered query text


$stmt = $this->pdo->prepare("SELECT `id`, WEIGHT() as `w`, in_stock>0 AS stock  FROM `items` WHERE MATCH ('".$search."') ORDER BY clearance ASC, stock DESC, w DESC LIMIT ".$limit." OPTION field_weights=(name=10, art=3, cat_names=3, model_names=3)");

SphinxQL does not understand the expression in the ORDER BY sorting section, so WEIGHT () and in_stock> 0 had to be placed in the fields. By the way, the default LIMIT is only 20.


Sorting will first give the position available, then temporarily absent, then archived. And all these three groups will be sorted by relevance (weight).


Through field_weights we set what fields will have big weight.


Having executed request, we will receive the sorted array id. But, unfortunately, the selection of data through WHERE id IN () will break this sorting. We'll have to form your request for each id.


At the debugging stage, the query "SHOW META" helps immediately after the SELECT query. Especially for checking wordforms dictionary and regular expression filters. You can see a list of keywords into which the query has decomposed.


Complement sql_query


We sell parts. I decided to add to the index the name of the product category and the name of the model for which the spare part is intended. But each product can be tied to several categories at once and be suitable for several models. And I discovered the GROUP_CONCAT function. It allows you to get data by grouping into a string. For example, the categories.name field will contain all categories of the selected items_zip.id separated by spaces.


SELECT items_zip.id, `art`, items_zip.`name`, `clearance`, `in_stock`,
   GROUP_CONCAT(DISTINCT categories.name SEPARATOR ' ') AS cat_names,
   GROUP_CONCAT(DISTINCT items.family SEPARATOR ' ') AS model_names
FROM items_zip LEFTJOIN items_cat ON items_cat.item_id=items_zip.id
    LEFTJOIN categories ON categories.id=items_cat.cat_id
    LEFTJOIN zip_comp ON zip_comp.zip_id=items_zip.id
    LEFTJOIN items ON zip_comp.model_id=items.id
WHERE items_zip.show_flag=1GROUPBY items_zip.id

Also popular now: