Clemson Home  >  CCIT HomeSkip NavigationA-Z Index    Calendar    CU Safety    Map    Webcams    Phonebook    

Pseudorandom Numbers and Condor

 Many Condor applications involve running large numbers of replications of simulations or randomized algorithms with different streams of "random" numbers. If the streams are not independent, then the results of the computations may be compromised. In this article, we describe how to generate multiple streams of pseudorandom numbers (PRNs) that minimize the risk of correlation between the streams. The article is written from the perspective of a Unix/Linux C programmer, but the principles are applicable across all platforms and implementation environments, although the details of what functions are called and their arguments may differ. 

Pseudorandom Numbers

Functions that purport to generate "random" numbers generally don't. Instead, they generate predictable but irregular pseudorandom sequences that tend to be distributed over their range similarly to some random distribution (most commonly a uniform distribution). Because the sequences are predictable, they are also reproducible--i.e., given the same starting value (the seed), a pseudorandom number generator (PRNG) will generate the same sequence every time it is used. To generate unpredictable (but still deterministic) sequences, the generators must be initialized with unpredictable seeds.

A PRNG has an internal state that is changed at each call. A subset of bits from that state is extracted and returned as the next value in the sequence. Once the internal state returns to a state visited before, the sequence of states repeats. The number of steps in the sequence before the initial state is revisited is the period of the generator. The period may depend on the seed value. The number of possible internal states is generally larger than the number of values that can be returned; thus, the same value can be returned for multiple states. The programmer often transforms the value again to fall in the range of interest and follow the desired distribution.

In the ISO C standard, the standard library provides the function rand(), which returns a sequence of psuedorandom integers between 0 and a number given by the macro RAND_MAX (defined in the standard header file stdlib.h). The function srand() takes an unsigned integer seed value as an argument. The programmer calls srand() to seed the PRNG, then calls rand() repetitively to generate the sequence. (If srand() is not called, the seed is initialized to 1.)

The existence of a global internal state typically makes PRNGs non-reentrant and non-threadsafe, and it means that the programmer can't extract multiple, interleaved streams of PRNs (at least not with the same statistical properties). One solution to this problem is to store the state in the calling program instead of in global memory. A companion function to rand() and srand() defined by the POSIX standard (but not the C standard) is rand_r(). Each call to this function returns the next pseudorandom number in the sequence and stores the internal state in a memory location in the calling program specified by its argument.

C's rand() function and other functions of its ilk may have statistical qualities that are not acceptable for many applications. One issue is that if rand() returns a nonnegative integer value and has an unsigned integer state, then the state contains only one bit more than the value range. For numerical algorithms, Monte Carlo simulations, etc., other PRNG methods provide statistically better sequences. The drand48() family uses a 48-bit state and an 31-bit value, and the POSIX random() family uses a user-definable state defaulting to 31 long integers and an integer value. Other methods may offer additional statistical benefits. But the general ideas presented here apply to most common PRNGs.

Once development and testing are complete, the user is likely to want to generate sequences based on unpredictable seeds. A common technique is to read the system clock and use its value (or its low-order bits) as a seed.

Misusing PRNGs in Condor

In a Condor run of simulations or randomized algorithms, the PRNGs run independently in each job. There are some specialized parallel and multi-stream PRNGs, but their requirement to share data among the processes using them makes them unsuitable for Condor.

Obviously, using rand() in a batch of Condor jobs without seeding the generators or seeding them all with the same value will cause all streams to be perfectly correlated and make the replication useless. Less obviously, the most common ways to seed the streams can result in unexpected correlations that invalidate any statistical analysis that assumes the streams are independent.

Suppose that if seeded with s0, the PRNG generates the stream

       p0, p1, p2, ...
and if seeded with s1, it generates the stream
      p1, p2, p3, ...

and so on. If it happens that process a is seeded with s0 and process b is seeded with s2, say, then the streams in the two processes will be in lock step, two values apart in the sequences. Any statistical analysis of the results that assumes the streams are independent will be invalid.

Independent Streams of PRNs in Condor

One cure for the problems described above is for processes to read non-overlapping portions of the sequence. But the irregularity of the sequence makes it difficult to choose seed values that guarantee that the resulting streams do not overlap. The simplest ways to guarantee non-overlapping streams involve simulating the generation of all values to be used by all jobs in the batch and split the resulting sequence into non-overlapping segments. Fortunately, this procedure is generally much less costly than the actual processing, and requires only a small, fixed amount of memory. The procedure requires that the programmer know an upper bound on the length of the stream that each job requires and that the bound on the total length of all streams not exceed the period of the generator.

If the PRNG makes its internal state available to the programmer, then the generation step can be carried out offline. If the internal state is hidden, then the individual jobs in the batch must carry out portions of the simulation themselves. In the following pages, we describe how to implement each of these strategies.

Pre-computed Seeds for Independent Streams

Generating Independent Streams Online



Maintained by CITI web services                    Copyright ©2008 Clemson University, Clemson, S.C. 29634, (864) 656-331