## Reservoir Sampling, Improved

### February 14, 2014

Shown below is an algorithm that Vitter calls a naive implementation of Algorithm Z:

`def algorithm_Z_naive(seq, k, T=5):`

it = iter(seq)

reservoir = list(islice(it, k))

t = k

# do first k*T steps with algorithm R

if T > 0:

for t, item in islice(izip(count(k), it), k*T):

j = randrange(0, t + 1)

if j < k:

reservoir[j] = item

t += 1

while 1: # loop over records in sequence

while 1: # rejection sampling loop

X = t * (random() ** (-1/k) - 1) # use inverse of G

S = int(X)

# first try if h(S) / cg(X) >= random

cgs = (t + 1) / (t - k + 1) * k / (t + X) * (t / (t + X)) ** k

rg = random() * cgs

hs = k / (t + 1) * ((t - k + 1) / (t + S - k + 1)) ** (k + 1)

if rg <= hs:

break

# h(S) / cg(X) < random - now test if f(S) / cg(X) >= random

if rg <= f(S, k, t):

break

try:

next_item = islice(it, S, S + 1).next() # skip S records

reservoir[randrange(0, k)] = next_item

t += S + 1

except StopIteration:

break

return reservoir

You can see samples and run the code at http://programmingpraxis.codepad.org/J8FSvlP2. Benchmarking the various algorithms gives the following results, with 10000000 records and *k* = 10:

`algorithm_R time: 12.986 secs`

algorithm_X time: 2.797 secs

algorithm_Z time: 0.171 secs

algorithm_Z_naive time: 0.174 secs

do_nothing time: 0.272 secs

It is striking that the `do_nothing`

method that iterates over all records is notably slower than Algorithm Z.

Pages: 1 2

This is meant to be a faithful rearrangement in Scheme of Vitter’s final Algorithm Z, adapted for input streams that keep producing some sort of end-of-stream object when there are no more items. The first comment consists of the main logic. Further comments will contain the procedures to compute the number of items to skip, and a couple of examples that double as minimal testing. (The X part of my code was initially all wrong. It’s less wrong after testing.)

;;; Using Gambit-C where:

;;; 0 <= (random-integer n) < n

;;; (random-real) in (0 .. 1)

`(define (sample stream n read eos-object?)`

(let ((reservoir (head stream n read eos-object?))

(m (exact->inexact n)))

(if (= (vector-length reservoir) n)

(let scan ((t m) (W (wandom m)))

(call-with-values

(lambda ()

(if (> t (* 22.0 m))

(zandom t m W)

(values (xandom t m) W)))

(lambda (S W)

(let ((item (skip stream (inexact->exact S) read eos-object?)))

(or (eos-object? item)

(begin

(vector-set! reservoir (random-integer n) item)

(scan (+ t S 1.0) W))))))))

reservoir))

This reads up to a given number of items from the stream and returns them in a vector:

(define (head stream n read eos-object?)

(let loop ((items '()) (k 0))

(if (= k n)

(list->vector (reverse items))

(let ((item (read stream)))

(if (eos-object? item)

(list->vector (reverse items))

(loop (cons item items) (+ k 1)))))))

This skips a given number of items and returns the next item or an end-of-stream object:

(define (skip stream n read eos-object?)

(let loop ((n n))

(let ((item (read stream)))

(if (or (= n 0) (eos-object? item))

item

(loop (- n 1))))))

(To continue.)

Continuing, this is used a number of times to generate a W (seeding Algorithm Z):

(define (wandom n)

(exp (- (/ (log (random-real)) n))))

This is Vitter’s Algorithm X, used until past a certain number of items; this is where I mis-translated badly at first; this returns S (the number of items to skip):

(define (xandom t n)

(let ((V (random-real)) (t (+ t 1.0)))

(do ((S 0.0 (+ S 1.0))

(t t (+ t 1.0))

(quot (/ (- t n) t) (/ (* quot (- (+ t 1.0)

n))

(+ t 1.0))))

((<= quot V) S))))

And this is Algorithm Z, used when past X; I translated the loop as a system of three local procedures that pass control to each other or return both S (the number of items to skip) and W (to seed the next call):

(define (zandom t n W)

(define (enter W)

(let ((X (* t (- W 1.0))))

(test1 (floor X) W (random-real) X)))

(define (test1 S W U X)

(let ((lhs (exp (/ (log (/ (* U (expt (/ (+ t 1.0)

&n

bsp; (+ t (- n) 1.0))

2)

(+ t (- n) 1.0 S))

(+ t X)))

n)))

(rhs (/ (* (/ (+ t X)

(+ t (- n) 1.0 S))

(+ t (- n) 1.0))

t)))

(if (> lhs rhs)

(test2 S W U X)

(values S (/ rhs lhs)))))

(define (test2 S W U X)

(call-with-values

(lambda ()

(if (< n S)

(values t (+ t (- n) 1.0 S))

(values (+ t (- n) S) (+ t 1.0))))

(lambda (denom numer-lim)

(do ((numer (+ t S) (- numer 1.0))

(denom denom (- denom 1.0))

(y (/ (* (/ (* U (+ t 1.0))

(+ t (- n) 1.0))

(+ t S 1.0))

(+ t X))

(/ (* y numer) denom)))

((< numer numer-lim)

(if (> (exp (/ (log y) n))

(/ (+ t X) t))

(enter (wandom n))

(values S (wandom n))))))))

(enter (wandom n)))

(To continue.)

Whew. Two minor instances of cut-and-paste damage in the previous entry. Sorry about those.

Continuing, this creates an increasing stream of integers to use for testing my sampler above; the end-of-stream object is the false boolean value, and to read a value, call the stream object without arguments:

(define (counter b e)

(lambda ()

(and (< b e)

(begin

(set! b (+ b 1))

(- b 1)))))

These can be called to produce a 1-sample and a 10-sample from a stream of 3-digit numbers:

(define (test1)

;; this should use xandom until t > 22.0 and then zandom

(sample (counter 100 1000) 1 (lambda (o) (o)) boolean?))

`(define (test10)`

;; this should use xandom until t > 220.0 and then zandom

(sample (counter 100 1000) 10 (lambda (o) (o)) boolean?))

(I wired the parameter 22 in the sampling algorithm. It should really be a parameter. Maybe one of those parameter objects that Schemes have recently acquired.)

This runs 1-samples from a thousand streams of 3-digit numbers:

(define (thousand)

(include "sort-okeefe.scm")

;; should produce a sorted list of 1000 three-digit numbers

;; which can be eyed for even distribution (with repetition)

;; (seems good to me)

(do ((k 0 (+ k 1))

(s '() (cons (vector-ref (test1) 0) s)))

((= k 1000) (sort s <))))

(End.)