gleb_l April 19, 2013 at 21:40

MS SQL: pseudo-random data generation using newID (). Features and pitfalls

From the sandbox

It is known that the built-in function newID () is widely used by developers not only for its intended purpose - that is, to generate unique primary keys, but also as a means to generate arrays of pseudorandom data.
As part of the built-in functions, newID () is actually the only one that is not only non-deterministic, but also super-non-deterministic, because unlike everyone else, it is able to produce a new value for each new line, and not the same for the whole batch - which makes it extremely useful for such mass generation. In addition to newID (), newSequentialID () also has this property, but its use elsewhere, except in setting the default value of columns of type uniqueidentifier, is prohibited.
You don’t have to go far for examples - below is the code:

SELECT TOP 100 ABS(CHECKSUM(NEWID())) % 1000
FROM sysobjects A
CROSS JOIN sysobjects B

or this one (if checksum seems like a time-consuming operation):

SELECT TOP 100 ABS(CONVERT(INT, (CONVERT(BINARY(4), (NEWID()))))) % 1000
FROM sysobjects A
CROSS JOIN sysobjects B

It will generate a table of 100 random integers in the range from 0 to 999.

For floating numbers, you can use the property of the rand () function to initialize the generator with an integer:

SELECT TOP 100 RAND(CHECKSUM(NEWID()))
FROM sysobjects A
CROSS JOIN sysobjects B

In this case, rand () is used essentially simply as a converter from the int32 range to the range [0..1). Statistical verification of the distribution quality by this method on the number of records of the order of a million shows that it is not inferior to the standard use of rand (), initialized once, and then used in the loop. Therefore - you can safely use.

Another interesting option is the generation of normally distributed data. Here we will use the Box-Muller method :

SELECT TOP 1000
      COS(2 * PI() * RAND(BINARY_CHECKSUM(NEWID()))) *
      SQRT(-2 * LOG(RAND(BINARY_CHECKSUM(NEWID()))))
FROM sysobjects A
CROSS JOIN sysobjects B
CROSS JOIN sysobjects C

Those who wish can verify that the generated distribution is very close to normal by plotting.

All this works well and allows you to very quickly generate at least a dozen million records without using “head-on” solutions such as loops, cursors, or even inserting records one by one into the database from the application layer. You only need to make sure that the tables that you use as the row source have sufficient capacity, and either increase the number of CROSS JOINs, or use table variables with the desired number of rows as the source.

However, the topic is not only about this. In the vast majority of cases, the generated rows are materialized, that is, inserted into a constant or temporary table, or into a table variable. If this is the case, then you can skip reading further - materialized data will work perfectly. However, there are cases where the above statements are used in subqueries. And here, the seemingly inexplicable features of the behavior of the SQL engine appear. Let's look at them with examples, and then try to analyze why this happens and how to deal with it:

First, just write statement with newID () in a subquery and run it several times in a loop:

declare @c int = 0
while @c < 5
begin
  SELECT * 
  FROM (
      SELECT ABS(CONVERT(INT, CONVERT(BINARY(4), NEWID()))) % 5 AS RNDIDX
      FROM (
	SELECT 1 AS ID
	) ROWSRC
   ) SUBQ
  set @c = @c + 1
end

The code works as expected - it produces 5 result sets, each strictly has one record with a number in the range from 0 to 4. I don’t give a screenshot of the results - when everything is all right, there is little sense in them.

Now more interesting. We are trying to tune the result from SUBQ to some other table. You can create it, or you can make subquery turn into subquery - the result of this will not change. We write:

declare @c int = 0
while @c < 5
begin
  SELECT * 
  FROM (
      SELECT ABS(CONVERT(INT, CONVERT(BINARY(4), NEWID()))) % 5 AS RNDIDX
      FROM (
	SELECT 1 AS ID
	) ROWSRC
  ) SUBQ
  INNER JOIN (
	SELECT 0 AS VAL 
	UNION ALL SELECT 1 
	UNION ALL SELECT 2 
	UNION ALL SELECT 3 
	UNION ALL SELECT 4
  ) NUM ON SUBQ.RNDIDX = NUM.VAL
  set @c = @c + 1
end

We look at the screenshot of the result of the execution - and slowly crawl under the chair - the number of lines in each result set is not exactly 1. Somewhere is empty (this can be explained at least somehow - INNER JOIN did not work because RNDIDX left the range [0. .4] (which is unbelievable in itself!)), But somewhere - more than one (!) Record.

Now we make an innocent change - change INNER to LEFT:

declare @c int = 0
while @c < 5
begin
  SELECT * 
  FROM (
      SELECT ABS(CONVERT(INT, CONVERT(BINARY(4), NEWID()))) % 5 AS RNDIDX
      FROM (
	SELECT 1 AS ID
	) ROWSRC
  ) SUBQ
  LEFT JOIN (
	SELECT 0 AS VAL 
	UNION ALL SELECT 1 
	UNION ALL SELECT 2 
	UNION ALL SELECT 3 
	UNION ALL SELECT 4
  ) NUM ON SUBQ.RNDIDX = NUM.VAL
  set @c = @c + 1
end

We perform - everything began to work correctly (!) - check plz yourself, I did not take a screenshot for the correct operation. Note that since for any RNDIDX value from the range [0..4] that is capable of outputting a SUBQ subquery, there is always a VAL value from the NUM subquery, from the point of view of logic, the result of LEFT and INNER JOIN should be the same. However, in fact this is not so!

Another test - we return INNER, but add TOP / ORDER BY in the first subquery. Why - more on that later, let's just try:

declare @c int = 0
while @c < 5
begin
  SELECT * 
  FROM (
      SELECT TOP 1 ABS(CONVERT(INT, CONVERT(BINARY(4), NEWID()))) % 5 AS RNDIDX
      FROM (
		SELECT 1 AS ID
		) ROWSRC
	  ORDER BY RNDIDX		
  ) SUBQ
  INNER JOIN (
	SELECT 0 AS VAL 
	UNION ALL SELECT 1 
	UNION ALL SELECT 2 
	UNION ALL SELECT 3 
	UNION ALL SELECT 4
  ) NUM ON SUBQ.RNDIDX = NUM.VAL
  set @c = @c + 1
end

Everything works correctly again! Mystic!

Googling, we find out that SQL-developers from all over the world periodically encounter such behavior - examples are here , or here

People assume that subquery materialization helps. Indeed, if we rewrite the example, first selecting the explicit records in the temporary table, and then only having finished it, everything works fine. Why is the replacement of INNER with LEFT or the addition of TOP / ORDER BY where it is not needed? All this is because - in one case there is a materialization of subquery results, in the other - no. An analysis of the plan of a more detailed case, for example, this one can show a more obvious difference:

DECLARE @B TABLE (VAL INT)
INSERT INTO @B
VALUES (0), (1), (2), (3), (4)
SELECT * 
FROM (
      SELECT ABS(CONVERT(INT, CONVERT(BINARY(4), NEWID()))) % 5 AS RNDIDX
      FROM @B
) SUBQ
INNER JOIN @B B ON SUBQ.RNDIDX = B.VAL
Смотрим естественно, только выборку (query 2), заполнение таблицы @B нам ни к чему:

We see that the query splices two streams of rows before calculating the value of the column depending on newID (). This can happen because the SQL engine considers that the value returned by newID (), although non-deterministic, does not change throughout the batch. However, this is not the case - and most likely therefore the request does not work correctly. Now we change INNER to LEFT, and look at the plan:

Yeah, LEFT JOIN forced the SQL engine to execute Compute Scalar before merging the threads, so our query began to work correctly.

Finally, check the version with TOP / ORDER BY:

Actually, the diagnosis is clear. MS SQL does not take into account the peculiarities of newID (), and accordingly, it makes incorrect plans, relying on the constant value returned by the function in the batch scope. There is a work around for this feature - to force the SQL engine to materialize the selection in any way before using it in dependent queries. How you will materialize is up to you, but it is probably best to use table variables, especially if the subsample size is small. Otherwise, the result, to put it mildly, is not 100% guaranteed; in addition, there is no guarantee that one day you yourself or someone else will not review the code by throwing out “unnecessary” TOP / ORDER BY or wisely replacing LEFT with INNER.

Actually, that's all. Have a nice SQL programming!

Tags:

MS SQL: pseudo-random data generation using newID (). Features and pitfalls

Also popular now: