Reservoir Sampling, Improved

February 14, 2014

Shown below is an algorithm that Vitter calls a naive implementation of Algorithm Z:

def algorithm_Z_naive(seq, k, T=5): it = iter(seq) reservoir = list(islice(it, k)) t = k # do first k*T steps with algorithm R if T > 0: for t, item in islice(izip(count(k), it), k*T): j = randrange(0, t + 1) if j < k: reservoir[j] = item t += 1 while 1: # loop over records in sequence while 1: # rejection sampling loop X = t * (random() ** (-1/k) - 1) # use inverse of G S = int(X) # first try if h(S) / cg(X) >= random cgs = (t + 1) / (t - k + 1) * k / (t + X) * (t / (t + X)) ** k rg = random() * cgs hs = k / (t + 1) * ((t - k + 1) / (t + S - k + 1)) ** (k + 1) if rg <= hs: break # h(S) / cg(X) < random - now test if f(S) / cg(X) >= random if rg <= f(S, k, t): break try: next_item = islice(it, S, S + 1).next() # skip S records reservoir[randrange(0, k)] = next_item t += S + 1 except StopIteration: break return reservoir

You can see samples and run the code at http://programmingpraxis.codepad.org/J8FSvlP2. Benchmarking the various algorithms gives the following results, with 10000000 records and k = 10:

algorithm_R time: 12.986 secs algorithm_X time: 2.797 secs algorithm_Z time: 0.171 secs algorithm_Z_naive time: 0.174 secs do_nothing time: 0.272 secs

It is striking that the do_nothing method that iterates over all records is notably slower than Algorithm Z.

Posted by programmingpraxis

Filed in Exercises

3 Comments »

3 Responses to “Reservoir Sampling, Improved”

Jussi Piitulainen said
February 15, 2014 at 4:26 PM
This is meant to be a faithful rearrangement in Scheme of Vitter’s final Algorithm Z, adapted for input streams that keep producing some sort of end-of-stream object when there are no more items. The first comment consists of the main logic. Further comments will contain the procedures to compute the number of items to skip, and a couple of examples that double as minimal testing. (The X part of my code was initially all wrong. It’s less wrong after testing.)

;;; Using Gambit-C where: ;;; 0 <= (random-integer n) < n ;;; (random-real) in (0 .. 1)
(define (sample stream n read eos-object?)   (let ((reservoir (head stream n read eos-object?))         (m (exact->inexact n)))     (if (= (vector-length reservoir) n)         (let scan ((t m) (W (wandom m)))           (call-with-values               (lambda ()                 (if (> t (* 22.0 m))                     (zandom t m W)                     (values (xandom t m) W)))             (lambda (S W)               (let ((item (skip stream (inexact->exact S) read eos-object?)))                 (or (eos-object? item)                     (begin                       (vector-set! reservoir (random-integer n) item)                       (scan (+ t S 1.0) W))))))))     reservoir))

This reads up to a given number of items from the stream and returns them in a vector:
(define (head stream n read eos-object?)   (let loop ((items '()) (k 0))     (if (= k n)         (list->vector (reverse items))         (let ((item (read stream)))           (if (eos-object? item)               (list->vector (reverse items))               (loop (cons item items) (+ k 1)))))))

This skips a given number of items and returns the next item or an end-of-stream object:
(define (skip stream n read eos-object?)   (let loop ((n n))     (let ((item (read stream)))       (if (or (= n 0) (eos-object? item))           item           (loop (- n 1))))))

(To continue.)
Jussi Piitulainen said
February 15, 2014 at 4:58 PM
Continuing, this is used a number of times to generate a W (seeding Algorithm Z):
(define (wandom n)   (exp (- (/ (log (random-real)) n))))

This is Vitter’s Algorithm X, used until past a certain number of items; this is where I mis-translated badly at first; this returns S (the number of items to skip):
(define (xandom t n)   (let ((V (random-real)) (t (+ t 1.0)))     (do ((S 0.0 (+ S 1.0))          (t t (+ t 1.0))          (quot (/ (- t n) t) (/ (* quot (- (+ t 1.0) n))                                 (+ t 1.0))))         ((<= quot V) S))))

And this is Algorithm Z, used when past X; I translated the loop as a system of three local procedures that pass control to each other or return both S (the number of items to skip) and W (to seed the next call):
(define (zandom t n W)   (define (enter W)     (let ((X (* t (- W 1.0))))       (test1 (floor X) W (random-real) X)))   (define (test1 S W U X)     (let ((lhs (exp (/ (log (/ (* U (expt (/ (+ t 1.0)                                    &n bsp;         (+ t (- n) 1.0))                                           2)                                   (+ t (- n) 1.0 S))                                (+ t X)))                        n)))           (rhs (/ (* (/ (+ t X)                         (+ t (- n) 1.0 S))                      (+ t (- n) 1.0))                   t)))       (if (> lhs rhs)           (test2 S W U X)           (values S (/ rhs lhs)))))   (define (test2 S W U X)     (call-with-values         (lambda ()           (if (< n S)               (values t (+ t (- n) 1.0 S))               (values (+ t (- n) S) (+ t 1.0))))       (lambda (denom numer-lim)         (do ((numer (+ t S) (- numer 1.0))              (denom denom (- denom 1.0))              (y (/ (* (/ (* U (+ t 1.0))                          (+ t (- n) 1.0))                       (+ t S 1.0))                    (+ t X))                 (/ (* y numer) denom)))             ((< numer numer-lim)              (if (> (exp (/ (log y) n))                     (/ (+ t X) t))                  (enter (wandom n))                  (values S (wandom n))))))))   (enter (wandom n)))

(To continue.)
Jussi Piitulainen said
February 15, 2014 at 5:18 PM
Whew. Two minor instances of cut-and-paste damage in the previous entry. Sorry about those.

Continuing, this creates an increasing stream of integers to use for testing my sampler above; the end-of-stream object is the false boolean value, and to read a value, call the stream object without arguments:
(define (counter b e)   (lambda ()     (and (< b e)          (begin            (set! b (+ b 1))            (- b 1)))))

These can be called to produce a 1-sample and a 10-sample from a stream of 3-digit numbers:
(define (test1)   ;; this should use xandom until t > 22.0 and then zandom   (sample (counter 100 1000) 1 (lambda (o) (o)) boolean?))
(define (test10)   ;; this should use xandom until t > 220.0 and then zandom   (sample (counter 100 1000) 10 (lambda (o) (o)) boolean?))
(I wired the parameter 22 in the sampling algorithm. It should really be a parameter. Maybe one of those parameter objects that Schemes have recently acquired.)

This runs 1-samples from a thousand streams of 3-digit numbers:
(define (thousand)   (include "sort-okeefe.scm")   ;; should produce a sorted list of 1000 three-digit numbers   ;; which can be eyed for even distribution (with repetition)   ;; (seems good to me)   (do ((k 0 (+ k 1))        (s '() (cons (vector-ref (test1) 0) s)))       ((= k 1000) (sort s <))))

(End.)

Programming Praxis