Full Text Search at InnoDB
Hello, Habrachitatel!
InnoDB full-text data mining is a well-known headache for many MySQL / InnoDB developers. For those who are not up to date, I will explain. The MyISAM table type has a full-text data search, but the table itself has historically had limitations that are fundamental in individual projects. The more advanced InnoDB table type does not have full-text search. So poor developers have to put up with either the limitations of MyISAM or the lack of search in InnoDB. I want to talk about what are the ways to organize a full-fledged search in InnoDB without magic and exclusively using regular means. It will also be interesting to compare the speed characteristics of each method.
For example, take a small table with 10,000 entries.
In this table we store the data of users of the site. On the site itself there is a user search form in which you can enter an arbitrary query of the form “Tolstoy Yasnaya Polyana”. To process such a request, a search must be carried out in several fields at once. We need a search for the fields login , name , surname , city , country . The request can be either a single word (name or city) or as a set of words separated by a space. The problem is that we need to search for this set of words in several fields at once, which is difficult to do in InnoDB without using additional functions.
There are several relatively simple methods for finding full-text data in InnoDB:
Let's consider each of them in more detail.
The first proposed method is to create an additional table in MyISAM. As you know, MyISAM pretty well supports full-text search and this can be used. All data from the main table ( users ) will be copied to this additional table . Synchronization will be provided by triggers. In the new table, add the fields login , name , surname , city , country . Thus, we will create a “mirror” of the main table, and we will work with it. To enable full-text search, add the FULLTEXT index there along all 5 fields together:
To synchronize data between the main table and the mirror table, we set up triggers for writing, changing and reading on users :
Write trigger:
Change trigger:
And a simple delete trigger:
The search is performed using the following query:
Here, the data is searched in the search table , the result is sorted by relevance, and at the output we get the corresponding entries from the users table .
The main advantage of this approach is the flexibility of the search by adding additional indexes and compiling new search combinations (country + city or login + name + surname). Thus, we are free to form new sets for search and relevance rules.
The disadvantages of this method (as well as all methods with the creation of a “mirror”) are the excessive storage of data. Therefore, it is advisable to use it with small amounts of data, as in our example.
The second method also consists in creating a data mirror, however here we will store data in only one field. In this task, the search is carried out immediately by a group of fields and we will try to combine them into one text field, separated by spaces. Thus, a single dataset will correspond to the whole data set in the users table . Create a search table with two id and text fields . Id - will correspond to the id of the main table ( users ), text is our "cached" data.
Synchronization is also carried out using triggers:
Adding:
Change:
Removal:
The search query looks like this:
This method is not as flexible as the previous one, however, as we will see later, it wins in speed with a large number of various requests.
The third method is based on creating a list of “keywords” - search tags. Keywords are fields in the users table . For example, for a user with fields, the
Data synchronization is also carried out due to triggers:
Creation:
Change:
Removal:
Search query:
Please note that if earlier relevancy was determined by the MyISAM built-in search engine, then in this case we determine it ourselves. As a result of the search, we received only those tags that match the request. And the more tags per user, the higher it is in the sample.
The given example has a drawback: with an equal number of tags, several records have a natural sorting, which is not always true in terms of relevance.
However, this method has a high potential for further development. First, we can add to the sorting the
The fourth method is harsh and does not use MyISAM as the previous ones. It also has no additional tables and triggers. We will just search the existing table. First we need to index all the fields in which the search will be performed.
In InnoDB, we can only search using the LIKE operator, but for its effective operation it is necessary to break the query into words, otherwise queries consisting of several words will be left without results. To split into words and compose a query, we write a function:
The function returns a fragment of the generated search query that you just need to substitute and execute:
You can also use temporary tables, they will give tangible convenience when processing query results.
There are a number of third-party full-text search solutions. The most popular platforms are Sphinx and Apache Lucene- based projects . Their use is meaningless with small amounts of data (such as in our example), and sometimes it is simply impossible due to restrictions (hoster, evil admin, crooked hands, etc.).
Compare the shown methods of full-text search (except for third-party solutions) to the speed of typical queries. We will compare using 50 queries of varying complexity as an example. To do this, we write a PHP script that will objectively calculate the average speed of the search by each of the above methods. In order to bring measurements closer to real conditions, we will carry out the second control measurement, in which the same search queries will be used. Here it will be possible to evaluate how well MySQL caching mechanisms are used in each method.
Comparison of the speed of search queries in the MySQL database in the InnoDB table by various methods: More:
As expected, direct LIKE searches at InnoDB turned out to be the slowest and significantly loses to everyone else. Of course, this method can still be optimized, but this is unlikely to give a significant gain in speed.
The three remaining search methods showed themselves at approximately the same level. As practice has shown, with a large number of identical requests, the use of keywords (tags) in MyISAM gives a tangible advantage. With a large number of various search queries, winning gives the second way - creating a cached mirror. If some fields are very different in size from others (article content, news text), then the first method is more effective - creating a mirror table.
Creating MyISAM mirrors should be used for small tables (10-50 thousand records in the table), if there are more records in the table, and technical capabilities allow using third-party mechanisms (Sphinx, Apache Lucene).
InnoDB full-text data mining is a well-known headache for many MySQL / InnoDB developers. For those who are not up to date, I will explain. The MyISAM table type has a full-text data search, but the table itself has historically had limitations that are fundamental in individual projects. The more advanced InnoDB table type does not have full-text search. So poor developers have to put up with either the limitations of MyISAM or the lack of search in InnoDB. I want to talk about what are the ways to organize a full-fledged search in InnoDB without magic and exclusively using regular means. It will also be interesting to compare the speed characteristics of each method.
For example, take a small table with 10,000 entries.
CREATE TABLE users (
id INT (11) NOT NULL AUTO_INCREMENT,
login VARCHAR (255) DEFAULT NULL,
`password` VARCHAR (255) DEFAULT NULL,
name VARCHAR (255) DEFAULT NULL,
surname VARCHAR (255) DEFAULT NULL,
email VARCHAR ( 255) NOT NULL,
country VARCHAR (255) DEFAULT NULL,
city VARCHAR (255) DEFAULT NULL,
PRIMARY KEY (id)
)
ENGINE = INNODB
In this table we store the data of users of the site. On the site itself there is a user search form in which you can enter an arbitrary query of the form “Tolstoy Yasnaya Polyana”. To process such a request, a search must be carried out in several fields at once. We need a search for the fields login , name , surname , city , country . The request can be either a single word (name or city) or as a set of words separated by a space. The problem is that we need to search for this set of words in several fields at once, which is difficult to do in InnoDB without using additional functions.
There are several relatively simple methods for finding full-text data in InnoDB:
- Using the Mirror Table in MyISAM
- Using a Mirror Table in MyISAM with Cached Data
- Using a Keyword Table in MyISAM
- Query parsing and direct search in InnoDB
- Using third-party solutions
Let's consider each of them in more detail.
Using the Mirror Table in MyISAM
The first proposed method is to create an additional table in MyISAM. As you know, MyISAM pretty well supports full-text search and this can be used. All data from the main table ( users ) will be copied to this additional table . Synchronization will be provided by triggers. In the new table, add the fields login , name , surname , city , country . Thus, we will create a “mirror” of the main table, and we will work with it. To enable full-text search, add the FULLTEXT index there along all 5 fields together:
CREATE TABLE search (
id INT (11) DEFAULT NULL,
login VARCHAR (255) DEFAULT NULL,
name VARCHAR (255) DEFAULT NULL,
surname VARCHAR (255) DEFAULT NULL,
country VARCHAR (255) DEFAULT NULL,
city VARCHAR (255) DEFA NULL,
FULLTEXT INDEX IX_search (city, country, login, name, surname)
)
ENGINE = MYISAM
To synchronize data between the main table and the mirror table, we set up triggers for writing, changing and reading on users :
Write trigger:
CREATE
TRIGGER `insert`
AFTER INSERT
ON users
FOR EACH ROW
BEGIN
INSERT INTO search (id``, `login`, name``, `surname`, country``, `city`) VALUES (
NEW.`id`,
NEW .`login`,
NEW.`name`,
NEW.`surname`,
NEW.`country`,
NEW.`city`
);
End
Change trigger:
CREATE
TRIGGER `update`
AFTER UPDATE
ON users
FOR EACH ROW
BEGIN
DELETE FROM` search` WHERE `id` = NEW.`id`;
INSERT INTO `search` (` id`, `login`,` name`, `surname`,` country`, `city`) VALUES (
NEW.` id`,
NEW.`login`,
NEW.`name`,
NEW.`surname`,
NEW.`country`,
NEW.`city`
);
End
And a simple delete trigger:
CREATE
TRIGGER `delete`
AFTER DELETE
ON users
FOR EACH ROW
BEGIN
DELETE FROM` search` WHERE `id` = OLD.`id`;
End
The search is performed using the following query:
SELECT `users`. * FROM` users`
INNER JOIN `search` ON`
search`.`id` =` users`.`id`
WHERE
MATCH (`search`.city,` search`.country, `search`. login, `search`.name,` search`.surname) AGAINST ('Vladimir Tupin Saint Petersburg' IN BOOLEAN MODE)> 0
ORDER BY MATCH (`search`.city,` search`.country, `search`.login , `search`.name,` search`.surname) AGAINST ('Vladimir Tupin Saint Petersburg' IN BOOLEAN MODE) DESC
Here, the data is searched in the search table , the result is sorted by relevance, and at the output we get the corresponding entries from the users table .
The main advantage of this approach is the flexibility of the search by adding additional indexes and compiling new search combinations (country + city or login + name + surname). Thus, we are free to form new sets for search and relevance rules.
The disadvantages of this method (as well as all methods with the creation of a “mirror”) are the excessive storage of data. Therefore, it is advisable to use it with small amounts of data, as in our example.
Using a Mirror Table in MyISAM with Cached Data
The second method also consists in creating a data mirror, however here we will store data in only one field. In this task, the search is carried out immediately by a group of fields and we will try to combine them into one text field, separated by spaces. Thus, a single dataset will correspond to the whole data set in the users table . Create a search table with two id and text fields . Id - will correspond to the id of the main table ( users ), text is our "cached" data.
CREATE TABLE search (
id INT (11) DEFAULT NULL,
`text` TEXT DEFAULT NULL,
FULLTEXT INDEX IX_search_text (` text`)
)
ENGINE = MYISAM
Synchronization is also carried out using triggers:
Adding:
CREATE
TRIGGER `insert`
AFTER INSERT
ON users
FOR EACH ROW
BEGIN
INSERT INTO search (` id`, `text`) VALUES (NEW.`id`,
LOWER (
CONCAT_WS ('',
NEW.`name`,
NEW.`surname `,
NEW.`login`,
NEW.`country`,
NEW.`city`
)
)
);
End
Change:
CREATE
TRIGGER `update`
AFTER UPDATE
ON users
FOR EACH ROW
BEGIN
DELETE FROM search WHERE` id` = NEW.`id`;
INSERT INTO search (`id`,` text`) VALUES (NEW.`id`,
LOWER (
CONCAT_WS ('',
NEW.`name`,
NEW.`surname`,
NEW.`login`,
NEW.country` ,
NEW.`city`
)
)
);
END CREATE
Removal:
TRIGGER `delete`
AFTER DELETE
ON users
FOR EACH ROW
BEGIN
DELETE FROM search WHERE` id` = OLD.`id`;
End
The search query looks like this:
SELECT `users`. * FROM` users`
INNER JOIN `search` ON`
search`.`id` =` users`.`id`
WHERE
MATCH (`search`.`text`) AGAINST ('Vladimir Tupin Saint Petersburg 'IN BOOLEAN MODE)> 0
ORDER BY MATCH (`search`.`text`) AGAINST (' Vladimir Tupin Saint Petersburg 'IN BOOLEAN MODE) DESC
This method is not as flexible as the previous one, however, as we will see later, it wins in speed with a large number of various requests.
Using a Keyword Table in MyISAM
The third method is based on creating a list of “keywords” - search tags. Keywords are fields in the users table . For example, for a user with fields, the
(id=2144; login= leo; name=Лев;surname=Толстой;city=’Ясная Поляна’;country=Россия;email=leo@tolstoy.ru;password=;)
keywords will be («leo»; «Лев»; «Толстой»; «Ясная Поляна»; «Россия»)
. All these words we will write in a separate MyISAM table, in which there will be two fields id and text . Id matches the id of the main table ( users ). And text is a field in which keywords tags will be written. Each user from the users table will have 5 entries in the new search table . Thus, we got a table of tags for each user.CREATE TABLE search (
id INT (11) DEFAULT NULL,
`text` VARCHAR (255) DEFAULT NULL,
FULLTEXT INDEX IX_search_text (` text`)
)
ENGINE = MYISAM
Data synchronization is also carried out due to triggers:
Creation:
CREATE
TRIGGER `insert`
AFTER INSERT
ON users
FOR EACH ROW
BEGIN
INSERT INTO search (` id`, `text`) VALUES
(NEW.`id`, NEW`login`),
(NEW.`` id`, NEW.` name`),
(NEW.`id`, NEW.`surname`),
(NEW `id`, NEW` count``,
(NEW `id`, NEW`city`);
End
Change:
CREATE
TRIGGER `update`
AFTER UPDATE
ON users
FOR EACH ROW
BEGIN
DELETE FROM search WHERE` id` = NEW.`id`;
INSERT INTO search (`id`,` text`) VALUES
(NEW..id`, NEW.`login`),
(NEW.`id`, NEW.`name`),
(NEW.`id`, NEW. `surname`),
(NEW..id`, NEW.`country`),
(NEW.`id`, NEW.`city`);
End
Removal:
CREATE
TRIGGER `delete`
AFTER DELETE
ON users
FOR EACH ROW
BEGIN
DELETE FROM search WHERE` id` = OLD.`id`;
End
Search query:
SELECT `users`. * FROM` users`
INNER JOIN `search` ON`
search`.`id` =` users`.`id`
WHERE
MATCH (`search`.`text`) AGAINST ('Vladimir Tupin Saint Petersburg 'IN BOOLEAN MODE)> 0
GROUP BY `search`.`id`
ORDER BY COUNT (*) DESC
Please note that if earlier relevancy was determined by the MyISAM built-in search engine, then in this case we determine it ourselves. As a result of the search, we received only those tags that match the request. And the more tags per user, the higher it is in the sample.
The given example has a drawback: with an equal number of tags, several records have a natural sorting, which is not always true in terms of relevance.
However, this method has a high potential for further development. First, we can add to the sorting the
ORDER BY
sum of the relevancy ratings from the query MATCH AGAINST
. Thus, the above drawback will be eliminated. Secondly, we can add to this table the additional weight of the tag field of weight, and to each field of the main table to associate the value of this weight. Thus, we can add sorting taking into account the significance (weight) of individual fields. This gives us the opportunity to focus on some fields without compromising the quality of the search.Query parsing and direct search in InnoDB
The fourth method is harsh and does not use MyISAM as the previous ones. It also has no additional tables and triggers. We will just search the existing table. First we need to index all the fields in which the search will be performed.
CREATE TABLE users (
id INT (11) NOT NULL,
login VARCHAR (255) DEFAULT NULL,
`password` VARCHAR (255) DEFAULT NULL,
name VARCHAR (255) DEFAULT NULL,
surname VARCHAR (255) DEFAULT NULL,
email VARCHAR (255 ) NOT NULL,
country VARCHAR (255) DEFAULT NULL,
city VARCHAR (255) DEFAULT NULL,
PRIMARY KEY (id),
INDEX city (city),
INDEX country (country),
INDEX email (email),
INDEX login (login),
INDEX name (name),
INDEX password (password),
INDEX surname (surname)
)
ENGINE = INNODB
In InnoDB, we can only search using the LIKE operator, but for its effective operation it is necessary to break the query into words, otherwise queries consisting of several words will be left without results. To split into words and compose a query, we write a function:
CREATE
FUNCTION search (str VARCHAR (255))
RETURNS varchar (255) CHARSET cp1251
BEGIN
DECLARE output VARCHAR (255) DEFAULT '';
DECLARE temp_str VARCHAR (255);
DECLARE first_part VARCHAR (255) DEFAULT "CONCAT_WS ('',` name`, `surname`,` login`, `country`,` city`) LIKE '% ";
DECLARE last_part VARCHAR (255) DEFAULT "% '";
WHILE LENGTH (str)! = 0 DO
SET temp_str = SUBSTRING_INDEX (str, '', 1);
IF temp_str = str
THEN
SET str = '';
ELSE
SET str = SUBSTRING (str, LENGTH (temp_str) + 2);
END IF;
IF output! = ''
END IF;
SET output = CONCAT (output, first_part, temp_str, last_part);
END WHILE;
RETURN output;
End
The function returns a fragment of the generated search query that you just need to substitute and execute:
SET @WHERE = CONCAT ('SELECT * FROM `users` WHERE', search ('Habra Khabrovich'));
PREPARE prepared FROM @WHERE;
EXECUTE prepared;
You can also use temporary tables, they will give tangible convenience when processing query results.
Using third-party solutions
There are a number of third-party full-text search solutions. The most popular platforms are Sphinx and Apache Lucene- based projects . Their use is meaningless with small amounts of data (such as in our example), and sometimes it is simply impossible due to restrictions (hoster, evil admin, crooked hands, etc.).
Comparison
Compare the shown methods of full-text search (except for third-party solutions) to the speed of typical queries. We will compare using 50 queries of varying complexity as an example. To do this, we write a PHP script that will objectively calculate the average speed of the search by each of the above methods. In order to bring measurements closer to real conditions, we will carry out the second control measurement, in which the same search queries will be used. Here it will be possible to evaluate how well MySQL caching mechanisms are used in each method.
Comparison of the speed of search queries in the MySQL database in the InnoDB table by various methods: More:
Method | The average speed of one request (sec.) | Average speed of execution of one repeated request (sec.) |
Using the Mirror Table in MyISAM | 0.029738 | 0.011974 |
Using a Mirror Table in MyISAM with Cached Data | 0.025652 | 0.012027 |
Using a Keyword Table in MyISAM | 0.027876 | 0.008866 |
Query parsing and direct search in InnoDB | 0.136091 | 0.09541 |
As expected, direct LIKE searches at InnoDB turned out to be the slowest and significantly loses to everyone else. Of course, this method can still be optimized, but this is unlikely to give a significant gain in speed.
The three remaining search methods showed themselves at approximately the same level. As practice has shown, with a large number of identical requests, the use of keywords (tags) in MyISAM gives a tangible advantage. With a large number of various search queries, winning gives the second way - creating a cached mirror. If some fields are very different in size from others (article content, news text), then the first method is more effective - creating a mirror table.
Creating MyISAM mirrors should be used for small tables (10-50 thousand records in the table), if there are more records in the table, and technical capabilities allow using third-party mechanisms (Sphinx, Apache Lucene).