14 | February | 2014 | Programming Praxis

Reservoir Sampling, Improved

February 14, 2014

[ Today’s exercise was written by guest author Paul Hofstra, who holds a PhD in Theoretical Nuclear Physics from the Free University in Amsterdam and is now retired from a major oil company. Guest authors are always welcome; contact me at the link in the menu bar if you are interested in writing for Programming Praxis. ]

Reservoir sampling is a method to get a random sample of size n from a stream of a priori unknown (and possibly very large) size. Jeffrey Scott Vitter gives several interesting algorithms for solving this problem; the algorithm that we used in a recent exercise is his Algorithm R, and he also gives Algorithm X and Algorithm Z. Algorithm R begins by placing the first n input items in an array called the reservoir, then for each succeeding input item chooses a random number that determines if the new item should replace one of the items in the reservoir. Algorithm X and Algorithm Z are faster because, instead of generating a random number for each input item, they use the random numbers to determine how many input items to skip; in the algorithms stated below, s is the number of items to skip and t is the number of items already processed. The expected number of skipped items is t / (n − 1) − 1. The probability function for the number of skipped records is f(s) = (n / (t − n)) · ((t−n)^s+1) / (t + 1)^s+1) and the distribution function for the probability S ≤ s is F(s) = 1 − ((t + 1 − n) ^s+1) / (t + 1)^s+1), where a^b = a (a + 1) … (a + b − 1).

For Algorithm X, generate a uniform random number r between 0 and 1 and determine S as the lowest value for which F(S) ≥ r. Then skip S records, swap the next record with a record in the reservoir, and continue until the input is exhausted.

Algorithm Z improves Algorithm X by rejection sampling. Two new functions g and h and a constant c are defined so that h(s) ≤ f(s) ≤ c g(s), where c does not depend on s. The functions are as close to f as possible, and the integral of g(G) has a simple inverse, making h(s) much faster to calculate than f(s). Now calculate X = G⁻¹(random) and S = ⌊X⌋. The rejection method rejects s if a random number r > f(S) / (c g(X)). To avoid the expensive calculation of f(S), r is first compared to h(S) / (c g(X)), and S is accepted if r is smaller; otherwise, f(S) must be calculated and compared. The calculations of h, g and c are: h(s) = n / (t + 1) · ((t − n + 1) / (t + s − n + 1))ⁿ⁺¹, g(s) = n / (t + s) · (t / (t + s))ⁿ, G⁻¹(y) = t((1 − y)^−1/n − 1), and c = (t + 1) / (t − n + 1). Since this is fastest when t is large, Vitter recommends using Algorithm X when t ≤ T · n with T = 22.

Your task is to write functions to implement Algorithm X and Algorithm Z. When you are finished, you are welcome to read or run a suggested solution, or to post your own solution or discuss the exercise in the comments below.

Posted by programmingpraxis

Filed in Exercises

3 Comments »

S	M	T	W	T	F	S
						1
2	3	4	5	6	7	8
9	10	11	12	13	14	15
16	17	18	19	20	21	22
23	24	25	26	27	28

Programming Praxis

Reservoir Sampling, Improved

February 14, 2014

Categories

Archives

Archives