Sometimes while you are first building and populating a data warehouse, or after it is populated with millions of records, you may want to get a better sampling of records to review instead of a simple TOP 1000. You may want to do this for some true statistical analysis or just to spot check some records.
There are a few different ways to get a random (or random-ish) sampling of rows with various pros and cons for ease of use and performance. In this post I will discuss some of the methods I have come across and why they work (or don’t work) the way they do. These techniques should all work in SQL Server 2005 and up.
SELECT FirstName, LastNameFROM Person.PersonTABLESAMPLE (10 PERCENT) ;
This is a quick and dirty way to get a sampling of rows, but you’ll note it’s a “sampling” and is not truly random at the row level. The reason for this distinction is that the TABLESAMPLE clause is actually randomly selecting pages and then returning all the rows from that page. This means that you will end up with clustered rows depending on how the rows are dispersed throughout pages. For the same reason, this method will usually return different numbers of records each time it is run.
The benefits of using this command are that it can be added on to just about any select statement (it can’t be used with derived tables, table valued functions, tables from linked-servers, rowset functions, or OPENXML) and will add relatively minimum overhead.
SELECT TOP 1000 *
ORDER BY NEWID()
A disadvantage to this method is that SQL must generate a GUID at runtime for every row returned. Interestingly, in SQL Server you are not able to use the RAND() function for this purpose because RAND() will generate the same “random” value for every row returned in the batch – whereas NEWID() will generate a unique value for each row.
You may find yourself in a situation where you want to make sure that your sampling is repeatable if you re-run the query, or if you pass the query off to someone else. In order to maintain repeatability, you can try one of the following techniques.
In order to have the result set repeatable, you can add the REPEATABLE option to the TABLESAMPLE clause along with an arbitrary seed value. The syntax is:
SELECT FirstName, LastName
TABLESAMPLE (10 PERCENT)
REPEATABLE (205) ;
oid the RAND() function because it will insert the same “random” number for each record in one INSERT statement. Instead, we can again use the NEWID() function. But for the purpose of a RandomNumber column, we don’t want a GUID, we want a numeric value – preferably a decimal value between 0 and 1 so we can use it to select a percentage of rows. To do this, you can use something like:
CAST(CHECKSUM(NEWID()) & 0x7fffffff AS float) / CAST (0x7fffffff AS int)
- Use CHECKSUM to generate an integer value from the NEWID() GUID.
- Use a bitwise AND (&) operator to compare the GUID checksum to the hexadecimal value 0x7fffffff (which, I found out, is the max signed value for the Int datatype). This ensures that the numerator will always be a positive integer <= the maximum signed Int value (2,147,483,647).
- Divide by the max signed Int value (2,147,483,647 again) to produce a decimal value < 1.
SELECT *FROM MyTableWHERE RandomNumber <= 0.1
FROM MyTableWHERE RandomNumber <=0.1AND MyIdentityColumn <= [CurrentMaxIdValue]