## Reservoir Sampling, Improved

### February 14, 2014

[ Today’s exercise was written by guest author Paul Hofstra, who holds a PhD in Theoretical Nuclear Physics from the Free University in Amsterdam and is now retired from a major oil company. Guest authors are always welcome; contact me at the link in the menu bar if you are interested in writing for Programming Praxis. ]

Reservoir sampling is a method to get a random sample of size *n* from a stream of *a priori* unknown (and possibly very large) size. Jeffrey Scott Vitter gives several interesting algorithms for solving this problem; the algorithm that we used in a recent exercise is his Algorithm R, and he also gives Algorithm X and Algorithm Z. Algorithm R begins by placing the first *n* input items in an array called the reservoir, then for each succeeding input item chooses a random number that determines if the new item should replace one of the items in the reservoir. Algorithm X and Algorithm Z are faster because, instead of generating a random number for each input item, they use the random numbers to determine how many input items to skip; in the algorithms stated below, *s* is the number of items to skip and *t* is the number of items already processed. The expected number of skipped items is *t* / (*n* − 1) − 1. The probability function for the number of skipped records is *f*(*s*) = (*n* / (*t* − *n*)) · ((*t*−*n*)^{s+1}) / (*t* + 1)^{s+1}) and the distribution function for the probability *S* ≤ *s* is *F*(*s*) = 1 − ((*t* + 1 − *n*) ^{s+1}) / (*t* + 1)^{s+1}), where *a*^{b} = *a* (*a* + 1) … (*a* + *b* − 1).

For Algorithm X, generate a uniform random number *r* between 0 and 1 and determine *S* as the lowest value for which *F*(*S*) ≥ *r*. Then skip *S* records, swap the next record with a record in the reservoir, and continue until the input is exhausted.

Algorithm Z improves Algorithm X by rejection sampling. Two new functions *g* and *h* and a constant *c* are defined so that *h*(*s*) ≤ *f*(*s*) ≤ *c* *g*(*s*), where *c* does not depend on *s*. The functions are as close to *f* as possible, and the integral of *g*(*G*) has a simple inverse, making *h*(*s*) much faster to calculate than *f*(*s*). Now calculate X = *G*^{−1}(*random*) and *S* = ⌊*X*⌋. The rejection method rejects *s* if a random number *r* > *f*(*S*) / (*c* *g*(*X*)). To avoid the expensive calculation of *f*(*S*), *r* is first compared to *h*(*S*) / (*c* *g*(*X*)), and *S* is accepted if *r* is smaller; otherwise, *f*(*S*) must be calculated and compared. The calculations of *h*, *g* and *c* are: *h*(s) = *n* / (*t* + 1) · ((*t* − *n* + 1) / (*t* + *s* − *n* + 1))^{n+1}, *g*(*s*) = *n* / (*t* + *s*) · (*t* / (*t* + *s*))^{n}, *G*^{−1}(*y*) = *t*((1 − *y*)^{−1/n} − 1), and *c* = (*t* + 1) / (*t* − *n* + 1). Since this is fastest when t is large, Vitter recommends using Algorithm X when *t* ≤ *T* · *n* with *T* = 22.

Your task is to write functions to implement Algorithm X and Algorithm Z. When you are finished, you are welcome to read or run a suggested solution, or to post your own solution or discuss the exercise in the comments below.

This is meant to be a faithful rearrangement in Scheme of Vitter’s final Algorithm Z, adapted for input streams that keep producing some sort of end-of-stream object when there are no more items. The first comment consists of the main logic. Further comments will contain the procedures to compute the number of items to skip, and a couple of examples that double as minimal testing. (The X part of my code was initially all wrong. It’s less wrong after testing.)

;;; Using Gambit-C where:

;;; 0 <= (random-integer n) < n

;;; (random-real) in (0 .. 1)

`(define (sample stream n read eos-object?)`

(let ((reservoir (head stream n read eos-object?))

(m (exact->inexact n)))

(if (= (vector-length reservoir) n)

(let scan ((t m) (W (wandom m)))

(call-with-values

(lambda ()

(if (> t (* 22.0 m))

(zandom t m W)

(values (xandom t m) W)))

(lambda (S W)

(let ((item (skip stream (inexact->exact S) read eos-object?)))

(or (eos-object? item)

(begin

(vector-set! reservoir (random-integer n) item)

(scan (+ t S 1.0) W))))))))

reservoir))

This reads up to a given number of items from the stream and returns them in a vector:

(define (head stream n read eos-object?)

(let loop ((items '()) (k 0))

(if (= k n)

(list->vector (reverse items))

(let ((item (read stream)))

(if (eos-object? item)

(list->vector (reverse items))

(loop (cons item items) (+ k 1)))))))

This skips a given number of items and returns the next item or an end-of-stream object:

(define (skip stream n read eos-object?)

(let loop ((n n))

(let ((item (read stream)))

(if (or (= n 0) (eos-object? item))

item

(loop (- n 1))))))

(To continue.)

Continuing, this is used a number of times to generate a W (seeding Algorithm Z):

(define (wandom n)

(exp (- (/ (log (random-real)) n))))

This is Vitter’s Algorithm X, used until past a certain number of items; this is where I mis-translated badly at first; this returns S (the number of items to skip):

(define (xandom t n)

(let ((V (random-real)) (t (+ t 1.0)))

(do ((S 0.0 (+ S 1.0))

(t t (+ t 1.0))

(quot (/ (- t n) t) (/ (* quot (- (+ t 1.0)

n))

(+ t 1.0))))

((<= quot V) S))))

And this is Algorithm Z, used when past X; I translated the loop as a system of three local procedures that pass control to each other or return both S (the number of items to skip) and W (to seed the next call):

(define (zandom t n W)

(define (enter W)

(let ((X (* t (- W 1.0))))

(test1 (floor X) W (random-real) X)))

(define (test1 S W U X)

(let ((lhs (exp (/ (log (/ (* U (expt (/ (+ t 1.0)

&n

bsp; (+ t (- n) 1.0))

2)

(+ t (- n) 1.0 S))

(+ t X)))

n)))

(rhs (/ (* (/ (+ t X)

(+ t (- n) 1.0 S))

(+ t (- n) 1.0))

t)))

(if (> lhs rhs)

(test2 S W U X)

(values S (/ rhs lhs)))))

(define (test2 S W U X)

(call-with-values

(lambda ()

(if (< n S)

(values t (+ t (- n) 1.0 S))

(values (+ t (- n) S) (+ t 1.0))))

(lambda (denom numer-lim)

(do ((numer (+ t S) (- numer 1.0))

(denom denom (- denom 1.0))

(y (/ (* (/ (* U (+ t 1.0))

(+ t (- n) 1.0))

(+ t S 1.0))

(+ t X))

(/ (* y numer) denom)))

((< numer numer-lim)

(if (> (exp (/ (log y) n))

(/ (+ t X) t))

(enter (wandom n))

(values S (wandom n))))))))

(enter (wandom n)))

(To continue.)

Whew. Two minor instances of cut-and-paste damage in the previous entry. Sorry about those.

Continuing, this creates an increasing stream of integers to use for testing my sampler above; the end-of-stream object is the false boolean value, and to read a value, call the stream object without arguments:

(define (counter b e)

(lambda ()

(and (< b e)

(begin

(set! b (+ b 1))

(- b 1)))))

These can be called to produce a 1-sample and a 10-sample from a stream of 3-digit numbers:

(define (test1)

;; this should use xandom until t > 22.0 and then zandom

(sample (counter 100 1000) 1 (lambda (o) (o)) boolean?))

`(define (test10)`

;; this should use xandom until t > 220.0 and then zandom

(sample (counter 100 1000) 10 (lambda (o) (o)) boolean?))

(I wired the parameter 22 in the sampling algorithm. It should really be a parameter. Maybe one of those parameter objects that Schemes have recently acquired.)

This runs 1-samples from a thousand streams of 3-digit numbers:

(define (thousand)

(include "sort-okeefe.scm")

;; should produce a sorted list of 1000 three-digit numbers

;; which can be eyed for even distribution (with repetition)

;; (seems good to me)

(do ((k 0 (+ k 1))

(s '() (cons (vector-ref (test1) 0) s)))

((= k 1000) (sort s <))))

(End.)