Reservoir Sampling, Improved

February 14, 2014

[ Today’s exercise was written by guest author Paul Hofstra, who holds a PhD in Theoretical Nuclear Physics from the Free University in Amsterdam and is now retired from a major oil company. Guest authors are always welcome; contact me at the link in the menu bar if you are interested in writing for Programming Praxis. ]

Reservoir sampling is a method to get a random sample of size n from a stream of a priori unknown (and possibly very large) size. Jeffrey Scott Vitter gives several interesting algorithms for solving this problem; the algorithm that we used in a recent exercise is his Algorithm R, and he also gives Algorithm X and Algorithm Z. Algorithm R begins by placing the first n input items in an array called the reservoir, then for each succeeding input item chooses a random number that determines if the new item should replace one of the items in the reservoir. Algorithm X and Algorithm Z are faster because, instead of generating a random number for each input item, they use the random numbers to determine how many input items to skip; in the algorithms stated below, s is the number of items to skip and t is the number of items already processed. The expected number of skipped items is t / (n − 1) − 1. The probability function for the number of skipped records is f(s) = (n / (tn)) · ((tn)s+1) / (t + 1)s+1) and the distribution function for the probability Ss is F(s) = 1 − ((t + 1 − n) s+1) / (t + 1)s+1), where ab = a (a + 1) … (a + b − 1).

For Algorithm X, generate a uniform random number r between 0 and 1 and determine S as the lowest value for which F(S) ≥ r. Then skip S records, swap the next record with a record in the reservoir, and continue until the input is exhausted.

Algorithm Z improves Algorithm X by rejection sampling. Two new functions g and h and a constant c are defined so that h(s) ≤ f(s) ≤ c g(s), where c does not depend on s. The functions are as close to f as possible, and the integral of g(G) has a simple inverse, making h(s) much faster to calculate than f(s). Now calculate X = G−1(random) and S = ⌊X⌋. The rejection method rejects s if a random number r > f(S) / (c g(X)). To avoid the expensive calculation of f(S), r is first compared to h(S) / (c g(X)), and S is accepted if r is smaller; otherwise, f(S) must be calculated and compared. The calculations of h, g and c are: h(s) = n / (t + 1) · ((tn + 1) / (t + sn + 1))n+1, g(s) = n / (t + s) · (t / (t + s))n, G−1(y) = t((1 − y)−1/n − 1), and c = (t + 1) / (tn + 1). Since this is fastest when t is large, Vitter recommends using Algorithm X when tT · n with T = 22.

Your task is to write functions to implement Algorithm X and Algorithm Z. When you are finished, you are welcome to read or run a suggested solution, or to post your own solution or discuss the exercise in the comments below.

Pages: 1 2