mubinov February 28, 2011 at 11:04

Full Text Search at InnoDB

From the sandbox

Hello, Habrachitatel!
InnoDB full-text data mining is a well-known headache for many MySQL / InnoDB developers. For those who are not up to date, I will explain. The MyISAM table type has a full-text data search, but the table itself has historically had limitations that are fundamental in individual projects. The more advanced InnoDB table type does not have full-text search. So poor developers have to put up with either the limitations of MyISAM or the lack of search in InnoDB. I want to talk about what are the ways to organize a full-fledged search in InnoDB without magic and exclusively using regular means. It will also be interesting to compare the speed characteristics of each method.

For example, take a small table with 10,000 entries.

CREATE TABLE users (
id INT (11) NOT NULL AUTO_INCREMENT,
login VARCHAR (255) DEFAULT NULL,
`password` VARCHAR (255) DEFAULT NULL,
name VARCHAR (255) DEFAULT NULL,
surname VARCHAR (255) DEFAULT NULL,
email VARCHAR ( 255) NOT NULL,
country VARCHAR (255) DEFAULT NULL,
city VARCHAR (255) DEFAULT NULL,
PRIMARY KEY (id)
)
ENGINE = INNODB

In this table we store the data of users of the site. On the site itself there is a user search form in which you can enter an arbitrary query of the form “Tolstoy Yasnaya Polyana”. To process such a request, a search must be carried out in several fields at once. We need a search for the fields login , name , surname , city , country . The request can be either a single word (name or city) or as a set of words separated by a space. The problem is that we need to search for this set of words in several fields at once, which is difficult to do in InnoDB without using additional functions.

There are several relatively simple methods for finding full-text data in InnoDB:

Using the Mirror Table in MyISAM
Using a Mirror Table in MyISAM with Cached Data
Using a Keyword Table in MyISAM
Query parsing and direct search in InnoDB
Using third-party solutions

Let's consider each of them in more detail.

Using the Mirror Table in MyISAM

The first proposed method is to create an additional table in MyISAM. As you know, MyISAM pretty well supports full-text search and this can be used. All data from the main table ( users ) will be copied to this additional table . Synchronization will be provided by triggers. In the new table, add the fields login , name , surname , city , country . Thus, we will create a “mirror” of the main table, and we will work with it. To enable full-text search, add the FULLTEXT index there along all 5 fields together:

CREATE TABLE search (
id INT (11) DEFAULT NULL,
login VARCHAR (255) DEFAULT NULL,
name VARCHAR (255) DEFAULT NULL,
surname VARCHAR (255) DEFAULT NULL,
country VARCHAR (255) DEFAULT NULL,
city VARCHAR (255) DEFA NULL,
FULLTEXT INDEX IX_search (city, country, login, name, surname)
)
ENGINE = MYISAM

To synchronize data between the main table and the mirror table, we set up triggers for writing, changing and reading on users :

Write trigger:

CREATE
TRIGGER `insert`
AFTER INSERT
ON users
FOR EACH ROW
BEGIN
INSERT INTO search (id``, `login`, name``, `surname`, country``, `city`) VALUES (
NEW.`id`,
NEW .`login`,
NEW.`name`,
NEW.`surname`,
NEW.`country`,
NEW.`city`
);
End

Change trigger:

CREATE
TRIGGER `update`
AFTER UPDATE
ON users
FOR EACH ROW
BEGIN
DELETE FROM` search` WHERE `id` = NEW.`id`;
INSERT INTO `search` (` id`, `login`,` name`, `surname`,` country`, `city`) VALUES (
NEW.` id`,
NEW.`login`,
NEW.`name`,
NEW.`surname`,
NEW.`country`,
NEW.`city`
);
End

And a simple delete trigger:

CREATE
TRIGGER `delete`
AFTER DELETE
ON users
FOR EACH ROW
BEGIN
DELETE FROM` search` WHERE `id` = OLD.`id`;
End

The search is performed using the following query:

SELECT `users`. * FROM` users`
INNER JOIN `search` ON`
search`.`id` =` users`.`id`
WHERE
MATCH (`search`.city,` search`.country, `search`. login, `search`.name,` search`.surname) AGAINST ('Vladimir Tupin Saint Petersburg' IN BOOLEAN MODE)> 0
ORDER BY MATCH (`search`.city,` search`.country, `search`.login , `search`.name,` search`.surname) AGAINST ('Vladimir Tupin Saint Petersburg' IN BOOLEAN MODE) DESC

Here, the data is searched in the search table , the result is sorted by relevance, and at the output we get the corresponding entries from the users table .

The main advantage of this approach is the flexibility of the search by adding additional indexes and compiling new search combinations (country + city or login + name + surname). Thus, we are free to form new sets for search and relevance rules.
The disadvantages of this method (as well as all methods with the creation of a “mirror”) are the excessive storage of data. Therefore, it is advisable to use it with small amounts of data, as in our example.

Using a Mirror Table in MyISAM with Cached Data

The second method also consists in creating a data mirror, however here we will store data in only one field. In this task, the search is carried out immediately by a group of fields and we will try to combine them into one text field, separated by spaces. Thus, a single dataset will correspond to the whole data set in the users table . Create a search table with two id and text fields . Id - will correspond to the id of the main table ( users ), text is our "cached" data.

CREATE TABLE search (
  id INT (11) DEFAULT NULL,
  `text` TEXT DEFAULT NULL,
  FULLTEXT INDEX IX_search_text (` text`)
)
ENGINE = MYISAM

Synchronization is also carried out using triggers:

Adding:

CREATE
TRIGGER `insert`
AFTER INSERT
ON users
FOR EACH ROW
BEGIN
  INSERT INTO search (` id`, `text`) VALUES (NEW.`id`,
    LOWER (
      CONCAT_WS ('',
          NEW.`name`,
          NEW.`surname `,
          NEW.`login`,
          NEW.`country`,
          NEW.`city`
      )
    )
  );
End

Change:

CREATE
TRIGGER `update`
AFTER UPDATE
ON users
FOR EACH ROW
BEGIN
  DELETE FROM search WHERE` id` = NEW.`id`;
  INSERT INTO search (`id`,` text`) VALUES (NEW.`id`,
    LOWER (
      CONCAT_WS ('',
          NEW.`name`,
          NEW.`surname`,
          NEW.`login`,
          NEW.country` ,
          NEW.`city`
      )
    )
  );
END CREATE

Removal:

TRIGGER `delete`
AFTER DELETE
ON users
FOR EACH ROW
BEGIN
DELETE FROM search WHERE` id` = OLD.`id`;
End

The search query looks like this:

SELECT `users`. * FROM` users`
INNER JOIN `search` ON`
search`.`id` =` users`.`id`
WHERE
MATCH (`search`.`text`) AGAINST ('Vladimir Tupin Saint Petersburg 'IN BOOLEAN MODE)> 0
ORDER BY MATCH (`search`.`text`) AGAINST (' Vladimir Tupin Saint Petersburg 'IN BOOLEAN MODE) DESC

This method is not as flexible as the previous one, however, as we will see later, it wins in speed with a large number of various requests.

Using a Keyword Table in MyISAM

The third method is based on creating a list of “keywords” - search tags. Keywords are fields in the users table . For example, for a user with fields, the

(id=2144; login= leo; name=Лев;surname=Толстой;city=’Ясная Поляна’;country=Россия;email=leo@tolstoy.ru;password=;)

keywords will be («leo»; «Лев»; «Толстой»; «Ясная Поляна»; «Россия»). All these words we will write in a separate MyISAM table, in which there will be two fields id and text . Id matches the id of the main table ( users ). And text is a field in which keywords tags will be written. Each user from the users table will have 5 entries in the new search table . Thus, we got a table of tags for each user.

CREATE TABLE search (
  id INT (11) DEFAULT NULL,
  `text` VARCHAR (255) DEFAULT NULL,
  FULLTEXT INDEX IX_search_text (` text`)
)
ENGINE = MYISAM

Data synchronization is also carried out due to triggers:

Creation:

CREATE
TRIGGER `insert`
AFTER INSERT
ON users
FOR EACH ROW
BEGIN
  INSERT INTO search (` id`, `text`) VALUES
    (NEW.`id`, NEW`login`),
    (NEW.`` id`, NEW.` name`),
    (NEW.`id`, NEW.`surname`),
    (NEW `id`, NEW` count``,
    (NEW `id`, NEW`city`);
End

Change:

CREATE
TRIGGER `update`
AFTER UPDATE
ON users
FOR EACH ROW
BEGIN
  DELETE FROM search WHERE` id` = NEW.`id`;
  INSERT INTO search (`id`,` text`) VALUES
    (NEW..id`, NEW.`login`),
    (NEW.`id`, NEW.`name`),
    (NEW.`id`, NEW. `surname`),
    (NEW..id`, NEW.`country`),
    (NEW.`id`, NEW.`city`);
End

Removal:

CREATE
TRIGGER `delete`
AFTER DELETE
ON users
FOR EACH ROW
BEGIN
DELETE FROM search WHERE` id` = OLD.`id`;
End

Search query:

SELECT `users`. * FROM` users`
INNER JOIN `search` ON`
search`.`id` =` users`.`id`
WHERE
MATCH (`search`.`text`) AGAINST ('Vladimir Tupin Saint Petersburg 'IN BOOLEAN MODE)> 0
GROUP BY `search`.`id`
ORDER BY COUNT (*) DESC

Please note that if earlier relevancy was determined by the MyISAM built-in search engine, then in this case we determine it ourselves. As a result of the search, we received only those tags that match the request. And the more tags per user, the higher it is in the sample.
The given example has a drawback: with an equal number of tags, several records have a natural sorting, which is not always true in terms of relevance.
However, this method has a high potential for further development. First, we can add to the sorting the ORDER BYsum of the relevancy ratings from the query MATCH AGAINST. Thus, the above drawback will be eliminated. Secondly, we can add to this table the additional weight of the tag field of weight, and to each field of the main table to associate the value of this weight. Thus, we can add sorting taking into account the significance (weight) of individual fields. This gives us the opportunity to focus on some fields without compromising the quality of the search.

Query parsing and direct search in InnoDB

The fourth method is harsh and does not use MyISAM as the previous ones. It also has no additional tables and triggers. We will just search the existing table. First we need to index all the fields in which the search will be performed.

CREATE TABLE users (
  id INT (11) NOT NULL,
  login VARCHAR (255) DEFAULT NULL,
  `password` VARCHAR (255) DEFAULT NULL,
  name VARCHAR (255) DEFAULT NULL,
  surname VARCHAR (255) DEFAULT NULL,
  email VARCHAR (255 ) NOT NULL,
  country VARCHAR (255) DEFAULT NULL,
  city VARCHAR (255) DEFAULT NULL,
  PRIMARY KEY (id),
  INDEX city (city),
  INDEX country (country),
  INDEX email (email),
  INDEX login (login),
  INDEX name (name),
  INDEX password (password),
  INDEX surname (surname)
)
ENGINE = INNODB

In InnoDB, we can only search using the LIKE operator, but for its effective operation it is necessary to break the query into words, otherwise queries consisting of several words will be left without results. To split into words and compose a query, we write a function:

CREATE
FUNCTION search (str VARCHAR (255))
  RETURNS varchar (255) CHARSET cp1251
BEGIN
  DECLARE output VARCHAR (255) DEFAULT '';
  DECLARE temp_str VARCHAR (255);
  DECLARE first_part VARCHAR (255) DEFAULT "CONCAT_WS ('',` name`, `surname`,` login`, `country`,` city`) LIKE '% ";
  DECLARE last_part VARCHAR (255) DEFAULT "% '";

  WHILE LENGTH (str)! = 0 DO
    SET temp_str = SUBSTRING_INDEX (str, '', 1);
    IF temp_str = str
      THEN
        SET str = '';
      ELSE
       SET str = SUBSTRING (str, LENGTH (temp_str) + 2);
    END IF;

    IF output! = ''

    END IF;

    SET output = CONCAT (output, first_part, temp_str, last_part);

  END WHILE;
  RETURN output;
End

The function returns a fragment of the generated search query that you just need to substitute and execute:

SET @WHERE = CONCAT ('SELECT * FROM `users` WHERE', search ('Habra Khabrovich'));
PREPARE prepared FROM @WHERE;
EXECUTE prepared;

You can also use temporary tables, they will give tangible convenience when processing query results.

Using third-party solutions

There are a number of third-party full-text search solutions. The most popular platforms are Sphinx and Apache Lucene- based projects . Their use is meaningless with small amounts of data (such as in our example), and sometimes it is simply impossible due to restrictions (hoster, evil admin, crooked hands, etc.).

Comparison

Compare the shown methods of full-text search (except for third-party solutions) to the speed of typical queries. We will compare using 50 queries of varying complexity as an example. To do this, we write a PHP script that will objectively calculate the average speed of the search by each of the above methods. In order to bring measurements closer to real conditions, we will carry out the second control measurement, in which the same search queries will be used. Here it will be possible to evaluate how well MySQL caching mechanisms are used in each method.

Comparison of the speed of search queries in the MySQL database in the InnoDB table by various methods: More:

Method	The average speed of one request (sec.)	Average speed of execution of one repeated request (sec.)
Using the Mirror Table in MyISAM	0.029738	0.011974
Using a Mirror Table in MyISAM with Cached Data	0.025652	0.012027
Using a Keyword Table in MyISAM	0.027876	0.008866
Query parsing and direct search in InnoDB	0.136091	0.09541

As expected, direct LIKE searches at InnoDB turned out to be the slowest and significantly loses to everyone else. Of course, this method can still be optimized, but this is unlikely to give a significant gain in speed.
The three remaining search methods showed themselves at approximately the same level. As practice has shown, with a large number of identical requests, the use of keywords (tags) in MyISAM gives a tangible advantage. With a large number of various search queries, winning gives the second way - creating a cached mirror. If some fields are very different in size from others (article content, news text), then the first method is more effective - creating a mirror table.

Creating MyISAM mirrors should be used for small tables (10-50 thousand records in the table), if there are more records in the table, and technical capabilities allow using third-party mechanisms (Sphinx, Apache Lucene).

Tags: