Better Pseudorandom Numbers

Pseudorandom number generators (PRNGs) assist simulations by giving different values tied to choices a person might make. True randomness cannot happen with deterministic software. Following in the footsteps of Donald Knuth (or directly), using the term “random” here to discuss deterministic values is easier, even if less precise than “pseudorandom”.

Present a large enough set of numbers, typically at least several thousand, to the Chi-Squared test for randomness and it will give the mathematical probability that the generated set has random qualities. Chi-Squared is not perfect, but it will give a good enough answer.

PRNGs built in to a programming language may have varying quality as Chi-Squared can show. But, a programmer could use even a good PRNG in a way that reduces its quality. When generating a random set for simulation, carelessly fitting a large range of possibilities into a smaller range of desired values can have devastating consequences on the quality of the set. Using Chi-Squared to test randomness of several approaches will show the best of several possible techniques.

Chi-Squared

When identifying a specific set of numbers, know the minimum, maximum, and range. The minimum and maximum are the smallest and largest values allowed. Difference is max minus min. Range is difference plus one, that is, for the integers from 1 to 10, min is 1, max is 10 and the range is 10-1+1=10. In other words, there are 10 numbers between 1 and 10 inclusive. While this seems obvious, consider the range -3 to +3: the difference is 3-(-3)=6, but the range is 3-(-3)+1=7.

Recall that A*B means multiply A by B, and A/B means divide A by B. Also, function names are frequently abbreviated, such as sqrt() for “square-root”, use parentheses to surround their comma-separated arguments, and typically return a value of some kind. Given a set of randomly generated numbers (N) in a specific range (R), such as positive numbers (greater than 0) less than 10, 100, or 1,000:

  1. Count how many times each number in the range appeared.
  2. Sum the squares of the counts (S).
  3. Scale the sum of squares by their expected frequency and subtract the number of samples: Chi-Squared = (S * R / N) - N.
  4. Chi-Squared result should be equal to the range (R) and no farther away than 2 * sqrt(R).

If the PRNG generated numbers in the range from 1 to 100 (R=100), the Chi-Squared result should equal 100, plus or minus 20 — that is, any Chi-Squared value lower than 80 or more than 120 is bad for that generated set. Anything closer to 100 is good. If the range was 1000, values between 937 and 1063 would be acceptable, with 1000 the ideal.

Test ${RANDOM}

In bash(1), the shell offers access to a PRNG through the shell variable, RANDOM. At the command line, enter:

$ echo ${RANDOM}

That gives a random number in the range {0..32767}. This is only 15 bits, so if bigger numbers are needed this wouldn’t be a good PRNG. Why 15 bits, not 16? Because the 16th bit has the sign, positive or negative,and only positive and zero come from ${RANDOM}.

Even so, how good is this generator? Here’s a script to test PRNGs:

#!/bin/bash

# randtest
#
# Generate random numbers by an algorithm in randgen(), then submit
# those numbers to a Chi-Squared test.

# Use bash's RANDOM shell variable
R=32768                 # Range of the called PRNG
prng() {                # Bash $RANDOM
  echo ${RANDOM}
}

randgen() {
  local i num

  for (( i = 0; i < N; i++ ))
  do
    num=$(prng)
    (( v[num]++ ))
  done
}

chisquared() {
  local i

  # Sum the squares of the counts
  for i in ${v[*]}
  do
    (( S += i * i ))
  done

  (( chisq = S * R / N - N ))
}

if (( "$#" < 1 ))
then
  echo "Missing argument!"                           1>&2
  echo "Usage: $(basename $0) N"                     1>&2
  echo "where N is qty to generate, minimum=10,000." 1>&2
  exit 1
fi

N=$(( $1 < 10000 ? 10000 : $1 ))

randgen

chisquared

gawk 'END {
  worst = 2 * sqrt('${R}');
  off = '${chisq}' - '${R}';

  printf("%d diff nums, ",  '${#v[*]}');
  printf("Chi2=%d, ",       '${chisq}');
  printf("Best=%d, ",       '${R}');
  printf("Worst=+/-%.0f, ", worst);
  printf("Off=%+4.0f ",     off);
  printf("(%+6.1f%%)\n",    (1 - (off < 0 ? -off : off) / worst) * 100);
}' </dev/null

Name the script file ~/bin/randtest and use chmod u+x on it to give yourself permission to run it.

The first function, named prng(), holds the PRNG to test. An R variable is set to the PRNG’s range. Bash‘s man page states this is restricted to the values {0..32767}, which is a range of 32,768 possible numbers. The prng() function outputs the current value of the shell variable, ${RANDOM}. Can’t get much easier.

The next function, randgen(), declares two local variables, i and num. Variable i will be the loop counter while variable num will hold the latest random number. When the function ends, local variables disappear from memory, restoring memory space to the system.

A for() loop starts by initializing, putting 0 into i, then tests whether the current value of i (still 0) is less than the variable N. Bash‘s arithmetic expression feature, using double parentheses, allows N to gives its value without requiring the $ prefix, but you can use the $ if it’s helpful. N is set as a global variable so it’s not defined within the function. It’s the quantity of random numbers to generate and will be set from the command line. If i‘s value, now 0, is less than N, run everything between the do and the done. After that code finishes, add one (increment) to i in the third section of the for() loop line, which makes it 1. Retest to see if that new value is still less than N. If so, rerun the do...done code. Keep going like that and eventually i will be the same value as N, so the loop will quit. When the loop quits, the randgen() function ends and returns.

The loop’s do...done body, executed once for each value of i, will call the prng() function and store the result into the num variable. Now, num becomes an index to an array variable named v[] for values. Arrays variable names are designated with brackets. The v[] array represents every random number picked. The subscript — the number inside the brackets — is the random number. The value stored in that element of v counts how many times the random generator selected that number. The ++ operator increments that counter. Using the double parens calculates without using more repetitive assignment syntax, such as i=i+1. Instead of using the num variable, the $(prng) function call, embedded in a subshell, could have been placed directly in the brackets, such as

(( v[$(prng)]++ ))

Doing this would eliminate the local num variable. Why use num? To allow inserting a debugging statement showing what number generated before committing it to the v[] array increment.

The next function is chisquared(). This uses a different loop style, for...in, to run through all the random numbers generated. Using the asterisk in the brackets, ${v[*]}, presents every hit count in the v[] array, space separated, to the loop. Each value now in i goes to the loop body where the it is squared then summed into S. When the loop finishes, scale that sum (S) across the range (R) per the total samples (N), and subtract the number of samples. Store that result in variable, chisq.

With all the functions defined, run the script’s main execution sequence. First check whether there was a command line argument. The bash variable, $#, has the number of command line arguments. If it’s less than 1, the user either forgot to give the quantity to generate (N) or wants to know how to use the program. Give the user a reminder. Using 1>&2 at the end of each echo line sends messages to the error output so it doesn’t mix with actual data output, then quits with a bad exit code number 1, meaning an error happened. A good exit code is 0. Think of it as answering the question, “Was there an error?” with false (0). Any non-zero exit code is a variation on true, meaning there was an error.

With a command line argument given, the largest random number to generate, execution continues. Sanity test that count. The conditional operator (?:), known as a ternary operator because it has three parts, tests a condition written before the ?. That question: Is the first command line argument ($1) less than a threshold?

From Knuth’s advice, 10,000 seems a reasonable minimal number of PRNG iterations. If less than that minimum, use what is given between the ? and : characters. Otherwise, any larger number is valid so use it after the : character. Assign the result to N.

N is a global variable for how many random numbers to generate. All the functions can use it. Call the randgen() function, then call the chisquared() function. Display the results for the values generated.

Bash‘s arithmetic is integer only, so it has none of the typical floating point functions, such as sqrt(). The bc(1) program calculates nicely, but its output scaling is troublesome. Delivering the shell script’s results to gawk(1) makes both floating point and sqrt() available and offers reliable formatting.

Usually, gawk takes a file name or a stream of data from the standard input, redirectable from the keyboard, and writes its data to the standard output, redirectable from the screen. Because this data neither comes from a file nor streams through stdin, gawk needs to know to just run a little program. The easiest way to do that is use gawk‘s END rule.

END code, like all gawk rules, is surrounded by braces {...}. It executes when all other rules finish the input stream. But a script like this has no stream, just a bunch of variables already set. Instead, redirecting input from /dev/null tells gawk the stream ends right away. That triggers immediate execution of the END rule.

Two values appear more than once in the output but were not calculated by the script. The Chi-Squared should get no farther away from the best value, R, than 2 * sqrt(R), the worst. How much Chi-Squared differs from the best is the offset. END calculates those two first.

Notice a little trick: command line gawk programs, as opposed to files, appear inside apostrophes to hide certain syntax from the shell. After the first apostrophe before END, the program returns to the shell when another apostrophe appears after sqrt‘s open parenthesis, just before ${R}. Then comes another apostrophe. The second apostrophe gives control back to the shell, which resolves ${R} to its range value. The third apostrophe makes gawk continue, using that shell variable’s value as a gawk constant. This close-open trick appears several times in the script to resolve the shell variables that the rest of the shell script filled.

The first printf(1) line displays the quantity of different random numbers using the ${#v[*]} notation. Recall that v[] is an array holding all the random number counts with the random number itself acting as the array index. The asterisk selects all the elements. Using the # gives the element count in the array, which is how many different random numbers generated.

The second shell variable ${chisq} shows the calculated Chi-Squared value. The third shows the best Chi-Squared value, ${R}, that the PRNG run could deliver, representing the perfect score. The most that Chi-Squared is allowed to deviate from the best score and still be considered at least somewhat acceptable comes from sqrt(). That limit stores in the variable named worst, displayed by the fourth printf(). A Chi-Squared difference from the best is shown in the offset variable (off), displayed by the fifth printf().

The sixth printf() takes the absolute value of the offset to turn it into a percentile independent of whether it is greater or smaller than the best. But, gawk does not have an absolute function, so it is created using the conditional operator (?:). Coming before the ? is a conditional test. If the test is true, use what comes between the ? and the : and move on. If it’s false, use what comes between the : and the end of the operation. Typically it’s best to put this expression into parentheses to separate it from the rest of the formula, as in this case. For absolute, a negative offset delivers its positive value. An already positive offset is left alone. Related the absolute value to the worst and then turn it into a fitness percentage.

The smallest offset possible is 0, which means the Chi-Squared was identical to the range — the best. When the offset is 100% of the worst allowed, that’s still acceptable, if weak. In other words, it could be more or less than the best, but not too much more and not too much less. The closer Chi-Squared is to the best, the smaller the offset. The smaller the offset, the closer the percentage gets to the 100%. It can’t go higher than +100%, but it could go not only down to 0% — the worst acceptable — but could get so bad that it’s in negative territory. That’s unacceptable. But, Chi-Squared can hiccup, so only a few such are allowed when the majority are acceptable.

Bash ${RANDOM} Tests

Here are two runs testing bash‘s RANDOM variable:

$ for n in {1..10}; do ./randtest 10000; done
8659 diff nums, Chi2=32362, Best=32768, Worst=+/-362, Off=-406 ( -12.1%)
8569 diff nums, Chi2=33142, Best=32768, Worst=+/-362, Off=+374 (  -3.3%)
8597 diff nums, Chi2=32893, Best=32768, Worst=+/-362, Off=+125 ( +65.5%)
8675 diff nums, Chi2=32270, Best=32768, Worst=+/-362, Off=-498 ( -37.6%)
8651 diff nums, Chi2=32552, Best=32768, Worst=+/-362, Off=-216 ( +40.3%)
8533 diff nums, Chi2=33483, Best=32768, Worst=+/-362, Off=+715 ( -97.5%)
8640 diff nums, Chi2=32775, Best=32768, Worst=+/-362, Off=  +7 ( +98.1%)
8586 diff nums, Chi2=33076, Best=32768, Worst=+/-362, Off=+308 ( +14.9%)
8536 diff nums, Chi2=33417, Best=32768, Worst=+/-362, Off=+649 ( -79.3%)
8603 diff nums, Chi2=32978, Best=32768, Worst=+/-362, Off=+210 ( +42.0%)

$ for n in {1..10}; do ./randtest 100000; done
31250 diff nums, Chi2=32418, Best=32768, Worst=+/-362, Off=-350 (  +3.3%)
31214 diff nums, Chi2=32755, Best=32768, Worst=+/-362, Off= -13 ( +96.4%)
31253 diff nums, Chi2=32726, Best=32768, Worst=+/-362, Off= -42 ( +88.4%)
31194 diff nums, Chi2=32848, Best=32768, Worst=+/-362, Off= +80 ( +77.9%)
31268 diff nums, Chi2=32844, Best=32768, Worst=+/-362, Off= +76 ( +79.0%)
31125 diff nums, Chi2=32377, Best=32768, Worst=+/-362, Off=-391 (  -8.0%)
31194 diff nums, Chi2=33127, Best=32768, Worst=+/-362, Off=+359 (  +0.8%)
31271 diff nums, Chi2=32714, Best=32768, Worst=+/-362, Off= -54 ( +85.1%)
31201 diff nums, Chi2=33018, Best=32768, Worst=+/-362, Off=+250 ( +30.9%)
31203 diff nums, Chi2=33069, Best=32768, Worst=+/-362, Off=+301 ( +16.9%)

The first run generated 10,000 different random numbers while the second run generated 100,000. The 10,000 run averaged about 8,600 different numbers with nearly 1400 repeats. A 10,000 run represents about 30% of the 32,768 range, but about 86% of the unique numbers were used. The 100,000 run is slightly more than 3x the range, giving a better chance that all 32,768 different numbers would get used.

In the 10,000 test, five negative runs (1st, 2nd, 4th, 6th, and 9th) showed entirely outside the allowable range. Negative percents show the PRNG way out of whack by being far outside the allowed worst. Sometimes Chi-Squared fouls up, but this is too many. The remaining five were acceptable, but one (7th) was very close to perfect.

The 100,000 test did much better. Only one miss (6th). Two other near misses (1st and 7th) raise questions. The rest of the runs were very promising, but with two (9th & 10th) closer to worst. Still, the $RANDOM shell variable isn’t bad, just not great.

C rand() Algorithm

A widely implemented algorithm is in C programming language libraries. Implementations vary (see a typical 15-bit version and a 32-bit version or direct). For more bits than 15, generate multiple values for so many bits, then shift those bits into position to combine as one number.

Try this definition:

# Use C's implementation (15-bit): See PJ Plauger, Std C Lib, pg 359
R=32768
seed=$(date '+%N')
prng() {
  (( seed = seed * 1103515245 + 12345 ))
  echo $(( (seed >> 16) & (R - 1) ))
}

Just add it to the script and comment out the earlier prng(). Again, while a 15-bit range isn’t wonderful, it’ll do for testing. Language libraries, like C and C++, use a 31-bit range.

The seed is taken once when the program runs, coming from the current nanosecond of the current second, it picks a number from 0 to 999,999,999. Obviously that cycles each second, but it’s not likely predictable for arbitrary program runs. In the test loop, such a seed selects for each of the 10 runs.

$ for n in {1..10}; do ./randtest 10000; done
8611 diff nums, Chi2=32860, Best=32768, Worst=+/-362, Off= +92 ( +74.6%)
8670 diff nums, Chi2=32349, Best=32768, Worst=+/-362, Off=-419 ( -15.7%)
8677 diff nums, Chi2=32336, Best=32768, Worst=+/-362, Off=-432 ( -19.3%)
8609 diff nums, Chi2=32860, Best=32768, Worst=+/-362, Off= +92 ( +74.6%)
8608 diff nums, Chi2=32781, Best=32768, Worst=+/-362, Off= +13 ( +96.4%)
8639 diff nums, Chi2=32611, Best=32768, Worst=+/-362, Off=-157 ( +56.6%)
8596 diff nums, Chi2=32821, Best=32768, Worst=+/-362, Off= +53 ( +85.4%)
8612 diff nums, Chi2=32683, Best=32768, Worst=+/-362, Off= -85 ( +76.5%)
8568 diff nums, Chi2=33162, Best=32768, Worst=+/-362, Off=+394 (  -8.8%)
8596 diff nums, Chi2=32860, Best=32768, Worst=+/-362, Off= +92 ( +74.6%)

$ for n in {1..10}; do ./randtest 100000; done
31195 diff nums, Chi2=33238, Best=32768, Worst=+/-362, Off=+470 ( -29.8%)
31229 diff nums, Chi2=32766, Best=32768, Worst=+/-362, Off=  -2 ( +99.4%)
31246 diff nums, Chi2=32713, Best=32768, Worst=+/-362, Off= -55 ( +84.8%)
31225 diff nums, Chi2=32873, Best=32768, Worst=+/-362, Off=+105 ( +71.0%)
31188 diff nums, Chi2=32669, Best=32768, Worst=+/-362, Off= -99 ( +72.7%)
31227 diff nums, Chi2=32880, Best=32768, Worst=+/-362, Off=+112 ( +69.1%)
31225 diff nums, Chi2=32717, Best=32768, Worst=+/-362, Off= -51 ( +85.9%)
31295 diff nums, Chi2=32104, Best=32768, Worst=+/-362, Off=-664 ( -83.4%)
31166 diff nums, Chi2=33077, Best=32768, Worst=+/-362, Off=+309 ( +14.7%)
31225 diff nums, Chi2=32968, Best=32768, Worst=+/-362, Off=+200 ( +44.8%)

In the 10,000 run, three of them (2nd, 3rd, & 9th) were negative. The seven remaining runs were nicely within range. Of those, one (6th) was farther away than most.

The 100,000 run showed two negatives (1st & 8th), but the rest were in range. Of those, one was clearly poorer than the rest (9th), but the three were clearly good (2nd, 3rd, & 7th).

From these samples, the Linear Congruential algorithm specified here seems to behave a little better than the one using bash‘s RANDOM. Still, these don’t look reliable for serious use. For less demanding simulations and random samplings, maybe.

True Random Number Generator

TRNGs collect noise data from hardware. On Linux, this is accessible from the special device, /dev/random. Reading from /dev/random gives a stream of truly random bits, up to 512 bytes at a time. But the bits are not always available. They depend on what the system is doing. If no bits are available, /dev/random blocks reads — that is, the reader is stuck waiting — until enough new values come.

Another special device, /dev/urandom, delivers up to 32 megabytes at a time from a TRNG bit stream when they’re available, supplemented by PRNG-created bytes when they’re not available. This semi-TRNG method does not block the reader, at the price of less randomness.

Blocking from /dev/random can be a problem. You typically don’t want to wait for it unless you need it for a security application, but those don’t have a problem with the 512-byte limit. For lower-quality uses /dev/urandom works just fine. Do not use /dev/urandom when high-quality randomness is required.

Because /dev/random and /dev/urandom are byte streams, delivering their bytes continuously until reading finishes or they’re all gone, reading a limited quantity from the shell works as follows:

# Using ${1} as N, get an N-byte wide number from urandom(4)
geturandom () {
  # NB: /dev/urandom returns up to 32MB of data. Use od(1) to limit it.
  echo $(( $(od -An -tu${1} -N${1} /dev/urandom) ))
}

This shell function will grab bytes from urandom and turn it into a decimal number. Pass the number of bytes needed as an argument. It works like this:

  1. od(1) turns binary data from /dev/urandom into readable numeric form, typically octal, hence the name, “Octal Dump”.
  2. Regular output of an octal (base-8) or hexadecimal (base-16) dump produces the byte offset number, the address. The -An option suppresses that so only the byte value will show.
  3. To show decimal (base-10) instead of the typical octal, set the type (-t) to unsigned decimal for the byte count passed in argument 1.
  4. Read only the number of bytes (-N) specified by that same byte count in argument 1.

The following examples use 2-byte (16-bit) numbers in a 65,536 range:

$ geturandom 2
50148

$ geturandom 2
58869

$ geturandom 2
28391

Here are 4-byte (32-bit) numbers in a 4,294,967,296 range:

$ geturandom 4
4194066069

$ geturandom 4
276452463

$ geturandom 4
3203788291

The same can be done with /dev/random — just change the function name and the /dev name. Beware that too much use could cause it to block:

$ getrandom 2
22022

$ getrandom 2
4627

$ getrandom 2
24003

Same with 4-byte (32-bit) numbers:

$ getrandom 4
4100038244

$ getrandom 4
395724824

$ getrandom 4
3428645410

Semi-Random vs True Random

So, what does the Chi-Squared look like for these generators? Plug in the geturandom() function and change prng() like this:

# Use the kernel's /dev/random or /dev/urandom generator for 16 bits
R=65536
prng() {
  geturandom 2
}

Here are the results of two runs that were very much slower:

$ for n in {1..10}; do ./randtest 10000; done
9277 diff nums, Chi2=65628, Best=65536, Worst=+/-512, Off= +92 ( +82.0%)
9235 diff nums, Chi2=66297, Best=65536, Worst=+/-512, Off=+761 ( -48.6%)
9308 diff nums, Chi2=65091, Best=65536, Worst=+/-512, Off=-445 ( +13.1%)
9244 diff nums, Chi2=65995, Best=65536, Worst=+/-512, Off=+459 ( +10.4%)
9249 diff nums, Chi2=66061, Best=65536, Worst=+/-512, Off=+525 (  -2.5%)
9280 diff nums, Chi2=65405, Best=65536, Worst=+/-512, Off=-131 ( +74.4%)
9292 diff nums, Chi2=65418, Best=65536, Worst=+/-512, Off=-118 ( +77.0%)
9278 diff nums, Chi2=65536, Best=65536, Worst=+/-512, Off=  +0 (+100.0%)
9275 diff nums, Chi2=65536, Best=65536, Worst=+/-512, Off=  +0 (+100.0%)
9280 diff nums, Chi2=65405, Best=65536, Worst=+/-512, Off=-131 ( +74.4%)

$ for n in {1..10}; do ./randtest 100000; done
51420 diff nums, Chi2=65078, Best=65536, Worst=+/-512, Off=-458 ( +10.5%)
51343 diff nums, Chi2=65150, Best=65536, Worst=+/-512, Off=-386 ( +24.6%)
51342 diff nums, Chi2=65259, Best=65536, Worst=+/-512, Off=-277 ( +45.9%)
51216 diff nums, Chi2=65713, Best=65536, Worst=+/-512, Off=+177 ( +65.4%)
51099 diff nums, Chi2=65968, Best=65536, Worst=+/-512, Off=+432 ( +15.6%)
51273 diff nums, Chi2=65657, Best=65536, Worst=+/-512, Off=+121 ( +76.4%)
51300 diff nums, Chi2=64959, Best=65536, Worst=+/-512, Off=-577 ( -12.7%)
51380 diff nums, Chi2=65221, Best=65536, Worst=+/-512, Off=-315 ( +38.5%)
51204 diff nums, Chi2=65536, Best=65536, Worst=+/-512, Off=  +0 (+100.0%)
51256 diff nums, Chi2=65592, Best=65536, Worst=+/-512, Off= +56 ( +89.1%)

Both the 10,000 and 100,000 runs did very well with only two negatives in the 10,000 run and only one negative in the 100,000 run. The 10,000 run of 16-bit urandom values showed two poorer runs (3rd & 4th) but generally good. The 100,000 run had four at the lower end (1st, 2nd, 5th, & 8th) of the allowed range.

To access the TRNG, change geturandom() in the prng() to use getrandom(). At two bytes per access, it won’t exceed the 512-byte limit, but if the system is relatively idle there could be some extra wait time to the test. Here are the results for getrandom():

$ for n in {1..10}; do ./randtest 10000; done
9272 diff nums, Chi2=65563, Best=65536, Worst=+/-512, Off= +27 ( +94.7%)
9259 diff nums, Chi2=65772, Best=65536, Worst=+/-512, Off=+236 ( +53.9%)
9272 diff nums, Chi2=65471, Best=65536, Worst=+/-512, Off= -65 ( +87.3%)
9299 diff nums, Chi2=65300, Best=65536, Worst=+/-512, Off=-236 ( +53.9%)
9285 diff nums, Chi2=65340, Best=65536, Worst=+/-512, Off=-196 ( +61.7%)
9264 diff nums, Chi2=65667, Best=65536, Worst=+/-512, Off=+131 ( +74.4%)
9251 diff nums, Chi2=65838, Best=65536, Worst=+/-512, Off=+302 ( +41.0%)
9292 diff nums, Chi2=65274, Best=65536, Worst=+/-512, Off=-262 ( +48.8%)
9292 diff nums, Chi2=65196, Best=65536, Worst=+/-512, Off=-340 ( +33.6%)
9238 diff nums, Chi2=66087, Best=65536, Worst=+/-512, Off=+551 (  -7.6%)

$ for n in {1..10}; do ./randtest 100000; done
51270 diff nums, Chi2=65689, Best=65536, Worst=+/-512, Off=+153 ( +70.1%)
51355 diff nums, Chi2=65264, Best=65536, Worst=+/-512, Off=-272 ( +46.9%)
51292 diff nums, Chi2=65276, Best=65536, Worst=+/-512, Off=-260 ( +49.2%)
51357 diff nums, Chi2=65159, Best=65536, Worst=+/-512, Off=-377 ( +26.4%)
51277 diff nums, Chi2=65293, Best=65536, Worst=+/-512, Off=-243 ( +52.5%)
51195 diff nums, Chi2=65692, Best=65536, Worst=+/-512, Off=+156 ( +69.5%)
51260 diff nums, Chi2=65942, Best=65536, Worst=+/-512, Off=+406 ( +20.7%)
51173 diff nums, Chi2=65660, Best=65536, Worst=+/-512, Off=+124 ( +75.8%)
51282 diff nums, Chi2=65595, Best=65536, Worst=+/-512, Off= +59 ( +88.5%)
51417 diff nums, Chi2=64647, Best=65536, Worst=+/-512, Off=-889 ( -73.6%)

In both the 10,000 and 100,000 runs, only one failed (10th) in each case. The rest were within Chi-Squared allowance. These fails could be the Chi-Squared hiccup.

Very little delay happened while the /dev/urandom and /dev/random versions ran. There were delays. To see this behavior, insert a call to date(1) in randgen() right after the do before the prng() call. Then, run it as follows:

for n in {1..10}; do ./randtest 10000; done | uniq -c | less

This lists the counts for each second showing how many random numbers were generated during that second. Ignore first and last entries before the Chi-Squared announcement. They will be smaller because of fewer generated numbers during the first and last fractions of a second. The full seconds in between should have counts about the same, with small variations due to system busyness. If one is significantly fewer, the random generator waited for an entropy refill.

Getting All the Bits

Random generators deliver bits that are either truly random (TRNG) coming from unpredictable sources, or pseudorandom (PRNG) having mathematical characteristics that appear random while not entirely unpredictable. Combining bits delivers numbers larger than 1, but the grouped bits may show mathematical characteristics less random than other bits in the same combined number. Problems appear most commonly in the least significant bits — the ones on the right. Linear Congruential generators tend to suffer this. Yet, in restricting the set of bits to a small range of numbers, such as rolling a die restricts the numbers to the range 1-6, or guess a number between 1 and 100 games, the most common mistake is using the MOD operator to isolate the big range generated into the small range desired. MOD keeps the least significant bits and throws away the most significant.

Arithmetic division delivers two answers: a quotient and a remainder. Regular floating point division combines the two into the answer, putting the quotient into the integer part of the answer and the remainder into the fractional part. Some arithmetic libraries — and calculators — provide separate DIV (quotient only) and MOD (remainder only) operations. These may be represented by a special symbol or word, but for now, call them DIV and MOD. Full division takes the remainder, represents it as a fraction of the divisor with the remainder as the numerator, then shows the decimal fraction or the rational fraction, depending on the calculator.

For example, three divided by two has the quotient 1 and the remainder 1, so the result could be represented in three ways:

  • 1 and 1/2 (mixed number using a rational fraction)
  • 1.5 (mixed using a decimal fraction)
  • 1R1 (R-notation: quotient, “R” separator, and remainder)

To scale a large number into a small subset of numbers, divide the large number by the range, throw out the quotient, and keep the remainder. Think of it this way: Even numbers are those divided by two without a remainder, that is, the remainder is zero. Odd numbers have a remainder of one. No matter what number divides by 2, the remainder will be only one of those two values. Similarly, divide any number by 6. Only six remainders can exist: 0, 1, 2, 3, 4, and 5. This suggests a rule:

There are as many remainders as the divisor with the smallest always 0 and the largest always one less than the divisor.

To roll a six-sided die, divide a random number by 6 and the remainder will be one of six possible values. Because a six-sided die typically ranges from 1 to 6, not the remainders 0 to 5, add one. So, the rule for scaling a large number into a smaller range:

scalednum = (num MOD range) + minimum

For any number, fit it into a desired range by dividing by that range. Throw out the quotient. Keep the remainder. Add the minimum possible value of the range.

For example, pick a number between 1 and 10:

  1. Range = diff + 1; diff = max - min; min = 1, max = 10; therefore: diff = 10 - 1 = 9; range = 9 + 1 = 10 numbers
  2. NumScaled = ( randomnum MOD 10 ) + 1

For another example, pick a number between -3 and +3:

  1. Range = +3 - -3 = 6 + 1 = 7 numbers
  2. NumScaled = ( randomnum MOD 7 ) + -3

In other words, a random number divided by 7 results in numbers with the range of 0 to 6 — seven different numbers. Add -3 to the smallest possible generated number (0) gives -3. Add that same -3 to the largest possible generated number (6) gives +3. Works pretty good.

Some generators deliver poor results in the least significant bits. If those bits are less random, using MOD isolates those bits by throwing away the more random bits on the left. Scaling methods using MOD seem simple, but they can wreak havoc. There must be a better way.

One solution takes bits from the left side or from the middle. For example, a six-sided die only requires three bits because those bits can store up to eight values (0 to 7). Say that 16 bits are available. Throw away the rightmost four bits by dividing by 16 to shift the bits four positions to the right. The original rightmost four bits fall off the cliff. That leaves 12 bits, or 4,096 values. Maybe those remaining 12 are more random than the discarded four. Dividing by 64 shifts six bits to the right, discarding the rightmost six bits. That leaves 10 bits, allowing 1,024 possible values when working with unsigned numbers. Working with 32-bit or 64-bit numbers, or even more if your generator offers more, allows a bigger range of desired numbers from the bits remaining.

But, throwing out low bits reduces the range of the generator. If the Chi-Squared of the generator is good, why not use the whole range?

Too Many Numbers

Another problem with using MOD to narrow the desired range is that extra numbers remain. For example, say the PRNG range is 16-bit (65,536), 32-bit (4,294,967,296), or even 64-bit (18,446,744,073,709,551,616). To simulate a die throw, divide the whole range possible by 6:

16-bit:                    10,922 R 4

32-bit:               715,827,882 R 4

64-bit: 3,074,457,345,618,258,602 R 4

For the 16-bit PRNG, there will be 10,922 of each possible number in the range of 6, but there will be 4 left over. If there was 0 left over you’d be able to use the full generator range. In other words, each of these generators gives numbers evenly distributed for the six values desired almost up to the end of the range, but then there’s a remainder: 4. That means the generator would give an extra 0, 1, 2, and 3, but not 4 or 5, throwing off the even distribution.

Another example: Pick a number from 1 to 100. A 16-bit generator gives 36 extra, 32-bit gives 96 extra, and 64-bit gives 16 extra. Those extras throw off the equal likelihood of numbers in the range.

One solution is to discover that the latest random number generated is outside the desired range. So, if R is the range of the generator, but N is the range you desire, prevent use of any generated number larger than R - (R MOD N). For the die throw, N is 6, preventing use of the last four values. For a six-sided die, a 16-bit generator should only give 65,532 numbers, not 65,536.

A generator gives its full range numbers. When one number is outside the desired range limit, reject it and get the next random number. Keep rejecting until the latest one falls in the desired range. That maintains the even distribution.

Buckets of Numbers

So far, two useful rules:

  1. To avoid the possible bad bits on the right side, don’t use MOD to throw the bits away. Use all the bits to get the desired range.
  2. To avoid an uneven distribution of numbers, don’t use the excess numbers in the generated range. Use the whole generated range but reject the portion at the end that exceeds the desired range.

To get the most out of a good generator, using all the bits means using the generator’s entire range except for the excess numbers according to the desired range. To accomplish this, don’t throw bits away to make the generator’s number conform with the desired range. Instead, subdivide the generator’s range into as many groups as the desired range. Each group, known as a bucket, holds all the numbers corresponding with one value in the desired range. The 3 bucket holds all the numbers that become a 3 and the 7 bucket holds all the numbers that become a 7.

Take the six-sided die example again. A 16-bit generator has 65,536 possible values. Dividing that by 6 gives 10,922R4. That means there will be 10,922 values in the generator’s total range for each of the desired buckets, 0-5, and four left that must be rejected to keep the distribution even. The buckets look like this:

0:           0..10,921

1:      10,922..21,843

2:      21,844..32,766

3:      32,766..43,687

4:      43,688..54,609

5:      54,610..65,531

Reject: 65,532..65,535

That accounts for every possible positive number in unsigned 16 bits. Any number generated will fall into one of those buckets. Say 22,222 comes up. Divide the number by the bucket size and the quotient becomes the bucket number. So, 22222 / 10922 is 2 with a remainder of 378, so quotient 2 is the bucket number. Recalling the minimum is 1, so, 2 + 1 means the die landed with 3 up. BTW, the remainder is which one of the 10,922 numbers in that bucket got picked, the 378th in bucket 2 got picked.

If R is the generator’s range and N is the desired range, the quantity in each bucket is the quotient (DIV) of R / N and the leftover to reject is the remainder (MOD) of R / N.

For example, pick a number N from 1 to 100. Using a 16-bit generator (R = 65,536):

The quotient:  R / N = 65536 / 100   = 655 numbers in each bucket
The remainder: R / N = 65536 MOD 100 =  36 numbers left to reject

Got a 32-bit generator? A 64-bit generator? That means each bucket holds more values, but the method is the same.

An Unrestricted Random Number Generator

Combining all this, here is a shell function using geturandom():

# Using ${1} as max, get a random number from 0..max
getrandomnum () {
  local rand_size=4                           # Limit generator to 32-bits
  local rand_max=$(( 2 ** (8 * ${rand_size}) ))

  local num_rand=$(( ${rand_max} + 1 ))       # Total values generated
  local num_bkts=$(( ${1} + 1 ))              # Total values requested

  local bkt_size=$(( num_rand / num_bkts ))   # Each bkt has this many nums
  local excess=$(( num_rand % num_bkts ))     # Generated leftover nums

  local x=$(geturandom ${rand_size})          # Generate a number
  while (( x >= (num_rand - excess) ))        # Does it fit in a bkt?
  do
    x=$(geturandom ${rand_size})              # Regen if in the excess set
  done

  echo $(( x / ${bkt_size} ))                 # Conform to requested range
}

Call this general function with an argument identifying the maximum number to generate. All variables defined here are local so they do not take memory from other parts of the program this function is in.

The generator draws from a 32-bit range (rand_size set to 4 bytes). This gives rand_max the maximum number possible for that many bytes from the generator. If a 64-bit range is needed, it may not be good enough to change rand_size to 8.

The bash installation may have a 32-bit or 64-bit limit, either signed or unsigned. Easiest way to find out what that limit is and whether it’s signed or unsigned is to run some quick command line tests:

$ echo $(( 2 ** 32 ))
4294967296

$ echo $(( 2 ** 64 ))
0

$ echo $(( 2 ** 63 - 1 ))
9223372036854775807

If the unsigned 32-bit test works, as the first example above, move on to the unsigned 64-bit test, the second example. If that works unsigned 64-bit integers are available. But this example failed. Third test is a signed 64-bit test. It drops the exponent by one and then subtracts one because signed 64-bit integers use 63 bits for the number and 1 bit for the sign (2’s complement notation). The third example succeeded, so this bash uses signed 64-bit integers.

When needing bigger integers, consider rewriting this function to use bc(1), which does decimal, not binary, arithmetic. For example:

$ echo '2 ^ 64' | bc
18446744073709551616

With rand_max set, add one to set num_rand to the total possible values the generator can give. Using 32 bits gives the 4-billion number above. Using the maximum argument passed in to the function in ${1}, set the number of buckets, then define how many of the total possible values fit into each bucket. Note the excess left over. Those numbers reject so the distribution will be even.

Then, the real work begins. Grab a number from geturandom() — or getrandom(), as preferred. If that number is bigger than the limit, falling into the excess territory, reject it and grab another. Keep doing so until the number fits the even distribution limit. Because the excess is so much smaller than the useful range, this loop might not cycle even once and, if so, maybe only once.

With the fit number selected, divide by the bucket size to restrict it to the desired max.

Here are five example runs using ./getrannum 65535:

28542
4350
24971
47832
49801

Using ./getrannum 6:

3
1
4
5
0

Using ./getrannum 100:

24
6
80
17
63

To use unsigned 32-bits, try using ./getrannum $(echo '2^32-1' | bc):

2795549804
4213141588
3027597758
1362896132
2801758545

To use signed 32-bits, ./getrannum $(echo '2^31-1' | bc):

1551147516
2061547205
820704577
909295193
1191660005

When using random numbers, it’s important to use a generator with good Chi-Squared results, to use all the numbers available in the generator, and to have a distribution that does not allow excess numbers. The code shown here easily translates to other programming languages.

Leave a Comment