Speeding up the selection of arbitrary MySQL records
Recently, the audience has revived with the question of random sampling from the table. There are a lot of optimization solutions, and now I’m probably not going to show you anything, I’ll just remind you of the main optimization methods - simplifying the query and indexing. Without preambles about freelancers, right to the point;) I
create a table:
Such a table on my antediluvian computer is enough to test the effectiveness.
Here is a simple ORDER BY RAND selection:
I create an indexed field so as not to do full-scan of the whole table:
Random is random, even distribution would be good, but it's fantastic (although you know, you can use the first characters of MD5 (`id`) and translate it to INT, there's nowhere more uniform). So, now for one f I came across 200 rows and 100 each. If this indicator becomes ineffective over time, then you can always increase the factor and get, say, 25-75 rows per index. The main thing is that there should be at least as many rows as we need to pull out randomly. Column f can be regenerated periodically, or after 1000 calls to the table. When inserting, new rows get the value f = 0, which will not greatly affect the quality of the samples, or set random values for inserts too.
I make a test sample of 10 rows using what I indexed:
I repeat the test by changing f:
In general, the sampling performance has increased by 120 times and this is without any complications. I see in this solution a lot of amenities:
For this example, let's take the top comment from a neighboring post , which is solved this way. We emulate the table of RSS feeds by adding the feed field containing the feed number:
In PHP, it is also advisable to specify the number f, so as not to make two requests to the MySQL server. Although, this is not critical. This will also work very quickly:
As you can see, not only complication can achieve optimization (in this article I optimized speed). Now the question is for you to think about, and for me to set forth in a future article. How can the quality of arbitrary samples be optimized , first of all getting rid of repetitions? ;)
Regards,
Mayam.
create a table:
CREATE TABLE `a` (
`id` int(8) unsigned NOT NULL AUTO_INCREMENT,
`md5` char(32) NOT NULL
PRIMARY KEY (`id`)
)
INSERT INTO `a` (`id`) VALUES (null),(null),(null),(null)... и так 163712 раза ;)
UPDATE `a` SET md5 = MD5(`id`);
Such a table on my antediluvian computer is enough to test the effectiveness.
Here is a simple ORDER BY RAND selection:
SELECT * FROM `a` ORDER BY RAND() LIMIT 10; -> (10 rows, Query took 0.3345 sec)
SELECT * FROM `a` ORDER BY RAND() LIMIT 10; -> (10 rows, Query took 0.2538 sec)
SELECT * FROM `a` ORDER BY RAND() LIMIT 10; -> (10 rows, Query took 0.2299 sec)
I create an indexed field so as not to do full-scan of the whole table:
ALTER TABLE `a` ADD `f` SMALLINT(3) UNSIGNED NOT NULL, ADD INDEX (`f`);
UPDATE `a` SET `f` = RAND()*1000;
The number 1000 is the main “accelerating” factor. To them I reduce the table in which the ordinary ORDER BY RAND will go 1000 times. With a table of 163712 rows, you should get about 164 rows per f. We check:SELECT COUNT(1) FROM `a` WHERE `f` = 123; -> 169
Random is random, even distribution would be good, but it's fantastic (although you know, you can use the first characters of MD5 (`id`) and translate it to INT, there's nowhere more uniform). So, now for one f I came across 200 rows and 100 each. If this indicator becomes ineffective over time, then you can always increase the factor and get, say, 25-75 rows per index. The main thing is that there should be at least as many rows as we need to pull out randomly. Column f can be regenerated periodically, or after 1000 calls to the table. When inserting, new rows get the value f = 0, which will not greatly affect the quality of the samples, or set random values for inserts too.
I make a test sample of 10 rows using what I indexed:
SELECT * FROM `a` WHERE `f` = 231 ORDER BY RAND() LIMIT 10; -> (10 rows, Query took 0.0801 sec)
SELECT * FROM `a` WHERE `f` = 231 ORDER BY RAND() LIMIT 10; -> (10 rows, Query took 0.0017 sec)
Yeah, the second time was a re-sorted selection from the mysql cache. If the repetition of the results is not very scary, then such a more nimble result will work, but it is better to change the number f with each request. I repeat the test by changing f:
SELECT * FROM `a` WHERE `f` = 396 ORDER BY RAND() LIMIT 10; -> (10 rows, Query took 0.0147 sec)
SELECT * FROM `a` WHERE `f` = 753 ORDER BY RAND() LIMIT 10; -> (10 rows, Query took 0.0020 sec)
SELECT * FROM `a` WHERE `f` = 189 ORDER BY RAND() LIMIT 10; -> (10 rows, Query took 0.0019 sec)
SELECT * FROM `a` WHERE `f` = 945 ORDER BY RAND() LIMIT 10; -> (10 rows, Query took 0.0235 sec)
In general, the sampling performance has increased by 120 times and this is without any complications. I see in this solution a lot of amenities:
- no need to climb far into the wilds and rack your brains "damn how I had this ... oh yes the procedure."
- simple integration;) the request code was complicated by just one condition.
- Well, the third is called extensibility: add your conditions and you will reduce the size of the scan, thereby increasing the sampling speed.
SELECT * FROM `a` WHERE `f` IN (100,500) ORDER BY RAND() LIMIT 10;
More life example
For this example, let's take the top comment from a neighboring post , which is solved this way. We emulate the table of RSS feeds by adding the feed field containing the feed number:
ALTER TABLE `a` ADD `feed` TINYINT(1) UNSIGNED NOT NULL, ADD INDEX (`feed`);
UPDATE `a` SET feed = RAND()*9;
And now, actually, a feint with ears:(SELECT * FROM `a` WHERE `f` = 283 AND `feed` = 0 ORDER BY RAND() LIMIT 10)
UNION (SELECT * FROM `a` WHERE `f` = 283 AND `feed` = 1 ORDER BY RAND() LIMIT 10)
UNION (SELECT * FROM `a` WHERE `f` = 283 AND `feed` = 2 ORDER BY RAND() LIMIT 10)
UNION (SELECT * FROM `a` WHERE `f` = 283 AND `feed` = 3 ORDER BY RAND() LIMIT 10)
UNION (SELECT * FROM `a` WHERE `f` = 283 AND `feed` = 4 ORDER BY RAND() LIMIT 10)
UNION (SELECT * FROM `a` WHERE `f` = 283 AND `feed` = 5 ORDER BY RAND() LIMIT 10)
UNION (SELECT * FROM `a` WHERE `f` = 283 AND `feed` = 6 ORDER BY RAND() LIMIT 10)
UNION (SELECT * FROM `a` WHERE `f` = 283 AND `feed` = 7 ORDER BY RAND() LIMIT 10)
UNION (SELECT * FROM `a` WHERE `f` = 283 AND `feed` = 8 ORDER BY RAND() LIMIT 10)
UNION (SELECT * FROM `a` WHERE `f` = 283 AND `feed` = 9 ORDER BY RAND() LIMIT 10)
ORDER BY feed; -> (97 rows, Query took 0.7973 sec)
С другим f -> (99 rows, Query took 0.0093 sec)
С другим f -> (98 rows, Query took 0.0197 sec)
Here it is advisable to choose more than 10 rows for fidelity;) and filter the excess in PHP. In PHP, it is also advisable to specify the number f, so as not to make two requests to the MySQL server. Although, this is not critical. This will also work very quickly:
SET @rnd = RAND(); SELECT * FROM `a` WHERE `f` = @rnd ORDER BY RAND() LIMIT 10;
As you can see, not only complication can achieve optimization (in this article I optimized speed). Now the question is for you to think about, and for me to set forth in a future article. How can the quality of arbitrary samples be optimized , first of all getting rid of repetitions? ;)
Regards,
Mayam.