
Quick selection of random values from large MySQL tables by condition
The task of choosing random rows from a table quite often arises for developers.
In case MySQL is used, it is usually solved in the following way: Such code works extremely slowly for large tables. If the query does not need to use WHERE or the table is small, there are effective solutions, for example habrahabr.ru/post/54176 or habrahabr.ru/post/55864 . But I did not find any ready-made solutions for a large table and the need to filter by condition, receiving new values at each request, so the description of my method is under a cat.
As it turned out, MySQL cannot efficiently select random rows using ORDER BY rand () LIMIT N, where it is necessary to filter rows by condition (although the same MSSQL does an excellent job of selecting random rows from a table with a large number of records).
So, solving the “head-on” problem, the request (in the table 5 million records): The request took 41.3544 seconds, which is unacceptably long. It is impossible to find the maximum and minimum id, and then select random id from the gap in this case: due to the WHERE clause, id are no longer in order and discharged. My solution is the following: a random_seed table is added, containing the id and random_seed fields, filled with random numbers, an index is added to this column, and the index is added to the column by which the selection will take place.
Now, in order to select random strings by condition, the query needs to be changed as follows (in the table 5 million records): The query took 0.0460 seconds, which is more than an acceptable result. The variable random_from_php is generated by the code that calls the request, which provides a random set of values for each request, this number will provide a selection of new random numbers. The random_seed table should have as many values as the table from which you want to take random rows + N records, where N is the maximum possible value of random_from_php. A real example from my work is the selection of random categories of goods from different combinations (total 4,000,000 records):
Advantages of the method described above:
+ The fastest possible way to select random rows from the table by condition
+ No need to re-generate random numbers for each row in the table
+ Request all the necessary values does not iterate, in one query
Cons:
- The need to enter an additional table
- The need to change familiar queries
In case MySQL is used, it is usually solved in the following way: Such code works extremely slowly for large tables. If the query does not need to use WHERE or the table is small, there are effective solutions, for example habrahabr.ru/post/54176 or habrahabr.ru/post/55864 . But I did not find any ready-made solutions for a large table and the need to filter by condition, receiving new values at each request, so the description of my method is under a cat.
SELECT *
FROM users
WHERE role_id=5
ORDER BY rand()
LIMIT 10
As it turned out, MySQL cannot efficiently select random rows using ORDER BY rand () LIMIT N, where it is necessary to filter rows by condition (although the same MSSQL does an excellent job of selecting random rows from a table with a large number of records).
So, solving the “head-on” problem, the request (in the table 5 million records): The request took 41.3544 seconds, which is unacceptably long. It is impossible to find the maximum and minimum id, and then select random id from the gap in this case: due to the WHERE clause, id are no longer in order and discharged. My solution is the following: a random_seed table is added, containing the id and random_seed fields, filled with random numbers, an index is added to this column, and the index is added to the column by which the selection will take place.
SELECT *
FROM users
WHERE role_id=5
ORDER BY rand()
LIMIT 10
Now, in order to select random strings by condition, the query needs to be changed as follows (in the table 5 million records): The query took 0.0460 seconds, which is more than an acceptable result. The variable random_from_php is generated by the code that calls the request, which provides a random set of values for each request, this number will provide a selection of new random numbers. The random_seed table should have as many values as the table from which you want to take random rows + N records, where N is the maximum possible value of random_from_php. A real example from my work is the selection of random categories of goods from different combinations (total 4,000,000 records):
SELECT
u1.*
FROM
users u1,
random_seed rs
WHERE
u1.role_id=5 AND u1.id=(rs.id+random_from_php)
ORDER BY
rs.random_seed
LIMIT 10
Normal: request | “Accelerated” request: |
SELECT oc1. * FROM object_category oc1 WHERE oc1.region_id = 6 ORDER BY RAND () LIMIT 10 | SELECT oc1. * FROM object_category oc1, random_seed rs WHERE oc1.id = (rs.id + 564756) AND oc1.region_id = 6 ORDER BY rs.random_seed LIMIT 10 |
Lead time: | |
1.726s | 0.007s |
1.851s | 0.010s |
1.803s | 0.006s |
1.784s | 0.008s |
Advantages of the method described above:
+ The fastest possible way to select random rows from the table by condition
+ No need to re-generate random numbers for each row in the table
+ Request all the necessary values does not iterate, in one query
Cons:
- The need to enter an additional table
- The need to change familiar queries