murminathor November 12, 2014 at 12:45

Postgres Sample N random entries

From the sandbox

When working on one project, it became necessary to write some kind of test system. The task was formulated like this:

from N entries in the database, it is necessary to select m (3-5) random rows in a series of k samples (mainly k = 2).

And now the same thing in human language: you need to select 3-5 random entries from the table twice. In this case, there should be no duplicates and sampling should occur randomly.

The first thing that comes to mind:

 SELECT *
  FROM data_set
  WHERE id NOT IN (1,2,3,4, 5)
  ORDER BY random()
  LIMIT 5;

And it will even work. That's just the price of such a decision ...

Therefore, I had to use the "higher mind" , which issued a hint of a solution .

WITH RECURSIVE r AS ( 
    WITH b AS (SELECT min(id), max(id) FROM table1) 
    ( 
        SELECT id, min, max, array[]::integer[] AS a, 0 AS n 
        FROM table1, b 
        WHERE id > min + (max - min) * random() 
        LIMIT 1 
    ) UNION ALL ( 
        SELECT t.id, min, max, a || t.id, r.n + 1 AS n 
        FROM table1 AS t, r 
        WHERE 
            t.id > min + (max - min) * random() AND 
            t.id <> all(a) AND 
            r.n + 1 < 10 
        LIMIT 1 
    ) 
) 
SELECT t.id FROM table1 AS t, r WHERE r.id = t.id;

But the solution is ... "slightly" incomprehensible. But dragging obscure decisions into the project is somehow reluctant. Therefore, it was decided to go in a proven way , where it was possible to find an explanation of the essence of recursive queries .

What. The logic in general has become more or less clear: n times we select one line with a random offset from the minimum ID (primary key). Thus, the query affects a maximum of N rows instead of completely enumerating the contents of the table.

But "it was smooth on paper." When trying to use "as is" (adjusted for table / field names), "surprises" surfaced:

The array on line 4 is created empty. Because of this, duplicates may appear in the final sample. Say the query in lines 4-7 returns id == 5. After that, it fulfills the request in the UNION block (lines 9-15) and at some stage returns the same id == 5, since the previous id value did not fall into the “ a ” array and the check in line 13 is “t.id <> all (a) ”was successful;
The number of values on the “output” was no more than indicated (p. 14). But less than that - easily. Even if this amount was guaranteed was in the table. Sometimes an empty selection was returned (the number of values is “0”).

It turned out to be relatively easy to deal with the first “feature” - it was enough to slightly change the line with the array initialization. Like that:

SELECT id, min, max, array[]::integer[] || id AS a, 0 AS n

But the second point made me brainwash. The catch was found in the very "heart" of the algorithm - sample recording from a range. Indeed, the recursive query tries to select the line for which the condition would be satisfied:

id > min + (max - min) * random()

But in the case when random () returns "1", this condition transforms into:

id > max

Naturally, in this case, the request will not find anything and will stop working. And if this happens at the first pass of the request, then the "output" will be empty. Even if the database obviously contains the necessary records.

The first impulse was to slightly change the condition and bring it roughly to this form:

id >= min + (max - min) * random()

This, so to speak, solution allowed to get at least one line at the exit. But it did not at all guarantee the receipt of a given number of rows as a result. I had to think further.

After long deliberation, many attempts, and liters of stimulant ,
such an option was born :

request code

WITH RECURSIVE r AS (
 WITH b AS (
   SELECT
  min(t.id),
  (
   SELECT t.id
   FROM table1 AS t
   ORDER BY t.id DESC
   LIMIT 1
   OFFSET 9
  ) max
  FROM table1 AS t
 )
 (
  SELECT
    id, min, max, array[]::integer[] || id AS a, 0 AS n
  FROM table1 AS t, b
  WHERE
    id >= min + ((max - min) * random())::int
  LIMIT 1
 ) UNION ALL (
  SELECT t.id, min, max, a || t.id, r.n + 1 AS n
  FROM {table} AS t, r
  WHERE
   t.id >= min + ((max - min) * random())::int AND
   t.id <> all(a) AND
   r.n + 1 < 10
  LIMIT 1
 )
)
SELECT t.* FROM table1 AS t, r WHERE r.id = t.id

All salt in lines 5-11. Those. In order to guarantee that at the “output” there will be N elements, it is necessary to proceed from the worst possible scenario. In this case, that random N will return 1 N times in a row. Therefore, you need to know / have N-1 values before max. How to achieve this in the request? Alternatively, sort all records by ID in descending order and shift “down” by N lines. And since "> =" is used in lines 19 and 25, the offset can be indicated by one less (N-1).

That’s good - the main difficulties are resolved and the “trifles” remain: the request in its current form is of little use. After all, you need to make a selection taking into account some conditions. In the simplest case, exclude from the selection the IDs of the records selected in the previous step. In addition, one cannot exclude the possibility of using some additional conditions imposed on rows in the table (is_active = true, is_deleted = false, ...). After some thought, it becomes clear that the conditions will have to be put in all the essential parts of the request (almost all subqueries). Like in the following template:

pattern code

WITH RECURSIVE r AS (
 WITH b AS (
   SELECT
  min(t.{pk}),
  (
   SELECT t.{pk}
   FROM {table} AS t
   WHERE {exclude} t.is_active
   ORDER BY t.{pk} DESC
   LIMIT 1
   OFFSET {limit} - 1
  ) max
  FROM {table} AS t WHERE {exclude} t.is_active
 )
 (
  SELECT
    t.{pk}, min, max, array[]::integer[] || t.{pk} AS a, 0 AS n
  FROM {table} AS t, b
  WHERE
    t.{pk} >= min + ((max - min) * random())::int AND
    {exclude}
    t.is_active
  LIMIT 1
 ) UNION ALL (
  SELECT t.{pk}, min, max, a || t.{pk}, r.n + 1 AS n
  FROM {table} AS t, r
  WHERE
   t.{pk} >= min + ((max - min) * random())::int AND
   t.{pk} <> all(a) AND
   r.n + 1 < {limit} AND
   {exclude}
   t.is_active
  LIMIT 1
 )
)
SELECT {fields} FROM {table} AS t, r WHERE r.{pk} = t.{pk}

Here in curly brackets are named parameters to be replaced:

{table} - the name of the table;
{pk} - the name of the PrimaryKey field;
{fields} - a list of fields to select (you can also specify "*");
{exclude} - a condition (set of conditions) to exclude records from the selection. For example, "t.id NOT IN (1,2,3,4)";
{limit} - the number of records in the final selection

And finally, the last question on the list, but not in importance: “is the game worth the candle”? What is the “profit” from this brain construct? We will check.

First, create a table in which the experiments will be conducted.

DROP TABLE IF EXISTS ttbl;
CREATE TABLE IF NOT EXISTS ttbl
(
  id serial NOT NULL,
  is_active BOOL NOT NULL DEFAULT true,
  CONSTRAINT ttbl_pkey PRIMARY KEY (id)
)
WITH (OIDS=FALSE);

Now fill it with data. In this case, it is necessary that the id values do not go sequentially, but have “holes”. Those. not "1, 2, 3, 4, 5 ..." but at least "1, 4, 5, 8 ....". To do this, we sketch a simple script.

script code

import random
import psycopg2
DB_TABLE = 'ttbl'
PK_NAME = 'id'
DATABASES = {
    'NAME': 'test_db',
    'USER': 'user_test',
    'PASSWORD': 'R#Q9Jw*ZEKWOhBX+EP|3/xGkQn3',
    'HOST': 'localhost',
    'PORT': '',
}
TOTAL = 10000000
current = 0
step = 10000
dev = 8
while current <= TOTAL:
    data = set()
    for x in range(current, current+step, 10):
        data.add(x + int(random.random() * dev))
    x = cur.execute(
        "INSERT INTO {t_table} VALUES {t_items};".format(
                t_table=DB_TABLE,
                t_items='(' + '), ('.join(map(str, data)) + ')'
        )
    )
    current += step
    print(x, current)
cur.execute('COMMIT;')

As you can see from the code, each request inserts a hundred records. Values change in increments of approximately “10”. Approximately each of the values may deviate by a random value in the range 0-dev. Those. with two consecutive values of x “340” and “350”, any values from the range 340-348 and 350-358 (342, 347, 340 ...; 351, 355, 358 ...) can be inserted into the table.

Total in the table turned

select count(id) from ttbl;

1,001,000 entries

A pretty decent amount. Let's try to make a selection. It is clear that a single sample is not an indicator. Therefore, we will make a series of consecutive launches. For definiteness, a series of 5 query starts of each type. We will tabulate the results and calculate the average.

Simple request first

select t.*
from ttbl t
where 
  t.id not in (1, 3, 10, 89, 99, 22, 24, 25, 28, 30) 
  AND t.is_active
order by random()
limit 5;

Results:

No. p / p	time ms
1	697
2	605
3	624
4	593
5	611

As you can see, the average query execution time is about * 600ms.

And now - the "monster".

watch monster

WITH RECURSIVE r AS (
  WITH b AS (
    SELECT
    min(t.id),
    (
      SELECT t.id
      FROM ttbl AS t
      WHERE
        t.id NOT IN (1, 3, 10, 89, 99, 22, 24, 25, 28, 30)
        AND t.is_active
      ORDER BY t.id DESC
      LIMIT 1
      OFFSET 5 - 1
    ) max
    FROM ttbl AS t
    WHERE 
      t.id NOT IN (1, 3, 10, 89, 99, 22, 24, 25, 28, 30)
      AND t.is_active
  )
  (
    SELECT
      id, min, max, array[]::integer[] || id AS a, 0 AS n
    FROM ttbl AS t, b
    WHERE
      id >= min + ((max - min) * random())::int AND
      t.id NOT IN (1, 3, 10, 89, 99, 22, 24, 25, 28, 30) AND
      t.is_active
    LIMIT 1
  ) UNION ALL (
    SELECT t.id, min, max, a || t.id, r.n + 1 AS n
    FROM ttbl AS t, r
    WHERE
      t.id > min + ((max - min) * random())::int AND
      t.id <> all( a ) AND
      r.n + 1 < 5 AND
      t.id NOT IN (1, 3, 10, 89, 99, 22, 24, 25, 28, 30) AND
      t.is_active
    LIMIT 1
  )
)
SELECT t.* FROM ttbl AS t, r WHERE r.id = t.id

Results:

No. p / p	time ms
1	fifteen
2	17
3	8
4	12
5	12

The average time is about * 15ms.

Total difference is about one and a half orders of magnitude (40-50 times). It was worth it.

Requests were checked including and at a cold start (after restarting the machine / daemon). And although the runtime in absolute terms changed, but the ratio remained constant (as far as possible). Those. a recursive query was always several tens of times faster than a “forehead” solution.

Notes

* Around, because the exact value of the role does not play due to deviations caused by the Postgres cache, the influence of the operating system, hardware, other software, etc.

Tags:

Postgres Sample N random entries

Notes

Also popular now: