Reservoir Sampling, Improved

February 14, 2014

Shown below is an algorithm that Vitter calls a naive implementation of Algorithm Z:

def algorithm_Z_naive(seq, k, T=5):
    it = iter(seq)
    reservoir = list(islice(it, k))
    t = k
    # do first k*T steps with algorithm R
    if T > 0:
        for t, item in islice(izip(count(k), it), k*T):
            j = randrange(0, t + 1)
            if j < k:
                reservoir[j] = item
        t += 1
    while 1: # loop over records in sequence
        while 1: # rejection sampling loop
            X = t * (random() ** (-1/k) - 1) # use inverse of G
            S = int(X)
            # first try if h(S) / cg(X) >= random
            cgs = (t + 1) / (t - k + 1) * k / (t + X) * (t / (t + X)) ** k
            rg = random() * cgs
            hs = k / (t + 1) * ((t - k + 1) / (t + S - k + 1)) ** (k + 1)
            if rg <= hs:
            # h(S) / cg(X) < random - now test if f(S) / cg(X) >= random
            if rg <= f(S, k, t):
            next_item = islice(it, S, S + 1).next() # skip S records
            reservoir[randrange(0, k)] = next_item
            t += S + 1
        except StopIteration:
    return reservoir

You can see samples and run the code at Benchmarking the various algorithms gives the following results, with 10000000 records and k = 10:

algorithm_R          time:    12.986 secs
algorithm_X          time:    2.797 secs
algorithm_Z          time:    0.171 secs
algorithm_Z_naive    time:    0.174 secs
do_nothing           time:    0.272 secs

It is striking that the do_nothing method that iterates over all records is notably slower than Algorithm Z.


Pages: 1 2

3 Responses to “Reservoir Sampling, Improved”

  1. Jussi Piitulainen said

    This is meant to be a faithful rearrangement in Scheme of Vitter’s final Algorithm Z, adapted for input streams that keep producing some sort of end-of-stream object when there are no more items. The first comment consists of the main logic. Further comments will contain the procedures to compute the number of items to skip, and a couple of examples that double as minimal testing. (The X part of my code was initially all wrong. It’s less wrong after testing.)

    ;;; Using Gambit-C where:
    ;;; 0 <= (random-integer n) < n
    ;;; (random-real) in (0 .. 1)

    (define (sample stream n read eos-object?)
      (let ((reservoir (head stream n read eos-object?))
            (m (exact->inexact n)))
        (if (= (vector-length reservoir) n)
            (let scan ((t m) (W (wandom m)))
                  (lambda ()
                    (if (> t (* 22.0 m))
                        (zandom t m W)
                        (values (xandom t m) W)))
                (lambda (S W)
                  (let ((item (skip stream (inexact->exact S) read eos-object?)))
                    (or (eos-object? item)
                          (vector-set! reservoir (random-integer n) item)
                          (scan (+ t S 1.0) W))))))))

    This reads up to a given number of items from the stream and returns them in a vector:

    (define (head stream n read eos-object?)
      (let loop ((items '()) (k 0))
        (if (= k n)
            (list->vector (reverse items))
            (let ((item (read stream)))
              (if (eos-object? item)
                  (list->vector (reverse items))
                  (loop (cons item items) (+ k 1)))))))

    This skips a given number of items and returns the next item or an end-of-stream object:

    (define (skip stream n read eos-object?)
      (let loop ((n n))
        (let ((item (read stream)))
          (if (or (= n 0) (eos-object? item))
              (loop (- n 1))))))

    (To continue.)

  2. Jussi Piitulainen said

    Continuing, this is used a number of times to generate a W (seeding Algorithm Z):

    (define (wandom n)
      (exp (- (/ (log (random-real)) n))))

    This is Vitter’s Algorithm X, used until past a certain number of items; this is where I mis-translated badly at first; this returns S (the number of items to skip):

    (define (xandom t n)
      (let ((V (random-real)) (t (+ t 1.0)))
        (do ((S 0.0 (+ S 1.0))
             (t t (+ t 1.0))
             (quot (/ (- t n) t) (/ (* quot (- (+ t 1.0)
                                    (+ t 1.0))))
            ((<= quot V) S))))

    And this is Algorithm Z, used when past X; I translated the loop as a system of three local procedures that pass control to each other or return both S (the number of items to skip) and W (to seed the next call):

    (define (zandom t n W)
      (define (enter W)
        (let ((X (* t (- W 1.0))))
          (test1 (floor X) W (random-real) X)))
      (define (test1 S W U X)
        (let ((lhs (exp (/ (log (/ (* U (expt (/ (+ t 1.0)
    bsp;         (+ t (- n) 1.0))
                                      (+ t (- n) 1.0 S))
                                   (+ t X)))
              (rhs (/ (* (/ (+ t X)
                            (+ t (- n) 1.0 S))
                         (+ t (- n) 1.0))
          (if (> lhs rhs)
              (test2 S W U X)
              (values S (/ rhs lhs)))))
      (define (test2 S W U X)
            (lambda ()
              (if (< n S)
                  (values t (+ t (- n) 1.0 S))
                  (values (+ t (- n) S) (+ t 1.0))))
          (lambda (denom numer-lim)
            (do ((numer (+ t S) (- numer 1.0))
                 (denom denom (- denom 1.0))
                 (y (/ (* (/ (* U (+ t 1.0))
                             (+ t (- n) 1.0))
                          (+ t S 1.0))
                       (+ t X))
                    (/ (* y numer) denom)))
                ((< numer numer-lim)
                 (if (> (exp (/ (log y) n))
                        (/ (+ t X) t))
                     (enter (wandom n))
                     (values S (wandom n))))))))
      (enter (wandom n)))

    (To continue.)

  3. Jussi Piitulainen said

    Whew. Two minor instances of cut-and-paste damage in the previous entry. Sorry about those.

    Continuing, this creates an increasing stream of integers to use for testing my sampler above; the end-of-stream object is the false boolean value, and to read a value, call the stream object without arguments:

    (define (counter b e)
      (lambda ()
        (and (< b e)
               (set! b (+ b 1))
               (- b 1)))))

    These can be called to produce a 1-sample and a 10-sample from a stream of 3-digit numbers:

    (define (test1)
      ;; this should use xandom until t > 22.0 and then zandom
      (sample (counter 100 1000) 1 (lambda (o) (o)) boolean?))

    (define (test10)
      ;; this should use xandom until t > 220.0 and then zandom
      (sample (counter 100 1000) 10 (lambda (o) (o)) boolean?))

    (I wired the parameter 22 in the sampling algorithm. It should really be a parameter. Maybe one of those parameter objects that Schemes have recently acquired.)

    This runs 1-samples from a thousand streams of 3-digit numbers:

    (define (thousand)
      (include "sort-okeefe.scm")
      ;; should produce a sorted list of 1000 three-digit numbers
      ;; which can be eyed for even distribution (with repetition)
      ;; (seems good to me)
      (do ((k 0 (+ k 1))
           (s '() (cons (vector-ref (test1) 0) s)))
          ((= k 1000) (sort s <))))


Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: