## Reservoir Sampling, Improved

### February 14, 2014

[ Today’s exercise was written by guest author Paul Hofstra, who holds a PhD in Theoretical Nuclear Physics from the Free University in Amsterdam and is now retired from a major oil company. Guest authors are always welcome; contact me at the link in the menu bar if you are interested in writing for Programming Praxis. ]

Reservoir sampling is a method to get a random sample of size *n* from a stream of *a priori* unknown (and possibly very large) size. Jeffrey Scott Vitter gives several interesting algorithms for solving this problem; the algorithm that we used in a recent exercise is his Algorithm R, and he also gives Algorithm X and Algorithm Z. Algorithm R begins by placing the first *n* input items in an array called the reservoir, then for each succeeding input item chooses a random number that determines if the new item should replace one of the items in the reservoir. Algorithm X and Algorithm Z are faster because, instead of generating a random number for each input item, they use the random numbers to determine how many input items to skip; in the algorithms stated below, *s* is the number of items to skip and *t* is the number of items already processed. The expected number of skipped items is *t* / (*n* − 1) − 1. The probability function for the number of skipped records is *f*(*s*) = (*n* / (*t* − *n*)) · ((*t*−*n*)^{s+1}) / (*t* + 1)^{s+1}) and the distribution function for the probability *S* ≤ *s* is *F*(*s*) = 1 − ((*t* + 1 − *n*) ^{s+1}) / (*t* + 1)^{s+1}), where *a*^{b} = *a* (*a* + 1) … (*a* + *b* − 1).

For Algorithm X, generate a uniform random number *r* between 0 and 1 and determine *S* as the lowest value for which *F*(*S*) ≥ *r*. Then skip *S* records, swap the next record with a record in the reservoir, and continue until the input is exhausted.

Algorithm Z improves Algorithm X by rejection sampling. Two new functions *g* and *h* and a constant *c* are defined so that *h*(*s*) ≤ *f*(*s*) ≤ *c* *g*(*s*), where *c* does not depend on *s*. The functions are as close to *f* as possible, and the integral of *g*(*G*) has a simple inverse, making *h*(*s*) much faster to calculate than *f*(*s*). Now calculate X = *G*^{−1}(*random*) and *S* = ⌊*X*⌋. The rejection method rejects *s* if a random number *r* > *f*(*S*) / (*c* *g*(*X*)). To avoid the expensive calculation of *f*(*S*), *r* is first compared to *h*(*S*) / (*c* *g*(*X*)), and *S* is accepted if *r* is smaller; otherwise, *f*(*S*) must be calculated and compared. The calculations of *h*, *g* and *c* are: *h*(s) = *n* / (*t* + 1) · ((*t* − *n* + 1) / (*t* + *s* − *n* + 1))^{n+1}, *g*(*s*) = *n* / (*t* + *s*) · (*t* / (*t* + *s*))^{n}, *G*^{−1}(*y*) = *t*((1 − *y*)^{−1/n} − 1), and *c* = (*t* + 1) / (*t* − *n* + 1). Since this is fastest when t is large, Vitter recommends using Algorithm X when *t* ≤ *T* · *n* with *T* = 22.

Your task is to write functions to implement Algorithm X and Algorithm Z. When you are finished, you are welcome to read or run a suggested solution, or to post your own solution or discuss the exercise in the comments below.