MySQL Select random strings in one query

What do we have?


There is a weak laptop, a table with several million rows, and you need to choose a different number of random rows in one query. Further samples do not interest us.

The table (test) has the following structure:
  • - pk_id (primary key)
  • - id (field filled with different numbers)
  • - value (field filled with rand ())

The primary key has no holes and starts at 1 .

Production methods


  1. ORDER BY rand + LIMIT

    image

    Getting one line:
    SELECT pk_id FROM test ORDER BY rand() LIMIT 1
    

    The average runtime in MySQL is 6.150 seconds.

    Let's try to take 100 records.
    SELECT pk_id FROM test ORDER BY rand() LIMIT 100
    

    The average runtime is 6.170-6.180 seconds.
    That is, the time difference between receiving 1 and 100 random strings is not significant.

  2. COUNT * rand ()

    image

    Getting one line:
    SELECT t.pk_id FROM test as t,
    (SELECT ROUND(COUNT(pk_id)*rand()) as rnd FROM test LIMIT 1) t
    WHERE t.pk_id = rnd
    

    Via
    ROUND(COUNT(pk_id)*rand())
    
    we get a random number from 0 to the number of rows in the table.
    Next, we assign the alias "rnd" to our random number and use it in WHERE for an equivalent comparison with pk_id.
    The average execution time is 1.04 seconds.
    Next, you need to slightly modify this request so that you can stretch several lines.
    Add a few more fields to our subquery and change the WHERE check from "=" to IN
    SELECT t.pk_id FROM test as t,
    (SELECT ROUND(COUNT(pk_id)*rand()) as rnd,
    ROUND(COUNT(pk_id)*rand()) as rnd2,
    ROUND(COUNT(pk_id)*rand()) as rnd3
    FROM test LIMIT 1) t
    WHERE t.pk_id IN (rnd,rnd2,rnd3)
    

    Average lead time is 1.163 seconds.
    With an increase in the number of rows received, the query execution time significantly increases.
    About 100 lines, it's even scary to think :)

  3. INFORMATION_SCHEMA + LIMIT

    image

    Getting one line:
    SELECT t.pk_id FROM test as t,
    (SELECT ROUND((SELECT table_rows as tr FROM information_schema.tables WHERE table_name = 'test') *rand()) as rnd 
    FROM test LIMIT 1) tmp
    WHERE t.pk_id = rnd
    

    Using the subquery, we get the number of rows in the table 'test', without using the aggregate function COUNT and further comparison occurs as in method 2. The
    average execution time is 0.042 seconds.
    The minimum noticed execution time is 0.003 seconds.
    Let's try to get 100 lines:
    SELECT t.pk_id FROM test as t,
    (SELECT ROUND((SELECT table_rows as tr FROM information_schema.tables WHERE table_name = 'test') *rand()) as rnd 
    FROM test LIMIT 100) tmp
    WHERE t.pk_id in (rnd)
    ORDER BY pk_id
    

    Change in the WHERE "=" to change the IN and the limit of rows returned by the subquery to 100.
    The average running time - 0.047 seconds
    time to produce 1000 records - 0.053 seconds
    Time to 10,000 records ~ 0.21 cekundy
    And finally, 100 000 entries take as 1.9 seconds
    Less of The approach is that in the resulting number of rows from INFORMATION_SCHEMA is slightly larger than COUNT (*) and therefore when returning 100,000 rows, 7-8 rows are lost. At 1-100, this is practically absent (The larger the table, the less chance). But you can always take 1-2 lines more for reinsurance :)

  4. MAX * rand ()

    Getting one line:
    SELECT t.pk_id FROM test as t,
    (SELECT ROUND((SELECT MAX(pk_id) FROM test) *rand()) as rnd 
    FROM test LIMIT 1) tmp
    WHERE t.pk_id = rnd
    

    Average lead time - 0.001 seconds
    Getting 100 rows:
    SELECT t.pk_id FROM test as t,
    (SELECT ROUND((SELECT MAX(pk_id) FROM test) *rand()) as rnd 
    FROM test LIMIT 100) tmp
    WHERE t.pk_id in (rnd)
    ORDER BY pk_id
    

    Average lead time is 0.003 seconds.

    The fastest way, compared to the previous ones.


conclusions


  • The first method is good in that in any case it will return a random string to you, regardless of the holes in the fields and their initial value, but the slowest
  • The second method is much better in tables where there are no holes. It works 6 times faster than the first method (on returning one line).
  • The third method can be used at your own peril and risk (which is very insignificant), because you can lose a row (s) with a rand () value as close to 1. As possible, the return speed of one row differs by 150 times compared to the first method.
    If it was not 100 lines that returned, but 99, then you can send a request to the server again.
  • The fourth method is the fastest and 6000 times faster ORDER BY rand ()


UPD: In the case of holes in the table, when returning one row in the second and third way, you can do an unequal check and> = and add LIMIT 1. Then the value will be returned even if it fell into the “hole”
Thanks xel for this remark.
UPD2: Added 4 receiving method. Thanks smagen for the idea.

Also popular now: