Reservoir Sampling

January 31, 2014

Our code is very concise:

(define (sample k xs)
  (let ((resv (list->vector (take k xs))))
    (do ((xs (drop k xs) (cdr xs)) (i k (+ i 1))) ((null? xs) resv)
      (let ((j (randint i)))
        (when (< j k) (vector-set! resv j (car xs)))))))

And here are some examples:

> (define xs '(a b c d e f g h i j k l m n o p q r s t u v w x y z))
> (sample 3 xs)
#(l z r)
> (sample 3 xs)
#(y g h)
> (sample 3 xs)
#(d k i)

We used take and drop from the Standard Prelude, and rand and randint from a previous exercise. You can run the program at http://programmingpraxis.codepad.org/VgxH2mwk.

Advertisement

Pages: 1 2

6 Responses to “Reservoir Sampling”

  1. Paul said

    In Python. Note that it is important to choose a random number inclusive i. Otherwise the first k items in the sequence will be undersampled. What is interesting about this form of reservir sampling is that you do not have to know the number of items in advance. On the downside, it is slow, as you need a random sample for every item.

    from itertools import islice, count, izip
    from random import randrange
    
    def reservoir_sample(seq, k):
        'take a sample with k items fom seq'
        it = iter(seq)
        reservoir = list(islice(it, k))
        for i, item in izip(count(k), it):
            j = randrange(0, i + 1) # 0 <= j <= i !!
            if j < k:
                reservoir[j] = item
        return reservoir
    
  2. Jussi Piitulainen said

    Fun! Python or something. (Same as Paul’s but different.)

    from random import randint
    from itertools import repeat, count
    from functools import partial
    from operator import getitem, gt
    def compose(g, f):
        def gf(a):
            return g(f(a))
        return gf
    def swap(f):
        def t(y,x):
            return f(x,y)
        return t
    def sample(source, m):
        source = iter(source)
        result = list(map(next, repeat(source, m)))
        for item, j in filter(compose(partial(gt, m),
                                      partial(swap(getitem), 1)),
                              zip(source,
                                  map(partial(randint, 0), count(m)))):
            result[j] = item
        return result
    

    Some test results:

    100-sampling [0, 1, ..., 999] 1000 times:
    population mean: 499.5
    mean deviation from population mean: 0.027230000000000528

    2-sampling [3, 1, 4] 999 times:
    335 [4, 1]
    328 [3, 1]
    337 [3, 4]

  3. Paul said

    In this paper various methods for reservoir sampling are given. The simplest is algorithm R and is the same as suggested above. Also 2 more algorithms are given, named X and Z. I give here implementations. Timings for k=10 (or 100) and N=10000000 are R: 13 sec, X: 2.4 sec and Z: 0.17 sec. (I left out the code for R, as that is the same as my last post.

    from __future__ import division
    
    from itertools import islice, count, izip
    from random import randrange, random
    from math import exp, log
    
    def algorithm_X(seq, k):
        it = iter(seq)
        reservoir = list(islice(it, k))
        t = k
        num = 0
        while 1:
            V = random()
            S = 0
            t += 1
            num += 1
            quot = num / t
            while quot > V:
                S += 1
                t += 1
                num += 1
                quot *= num / t
            next(islice(it, S, S), None)
            next_item = next(it, None)
            if next_item:
                M = randrange(0, k)
                reservoir[M] = next_item
            else:
                break
        return reservoir
    
    def algorithm_Z(seq, k, T=22):
        it = iter(seq)
        reservoir = list(islice(it, k))
        t = k
        tresh = T * k
        num = 0
        while t <= tresh:
            V = random()
            S = 0
            t += 1
            num += 1
            quot = num / t
            while quot > V:
                S += 1
                t += 1
                num += 1
                quot *= num / t
            next(islice(it, S, S), None)
            next_item = next(it, None)
            if next_item:
                M = randrange(0, k)
                reservoir[M] = next_item
            else:
                break
        W = exp(-log(random()) / k)
        term = t - k + 1
        while 1:
            U = random()
            X = t * (W - 1)
            S = int(X)
            lhs = exp(log(((U *(((t + 1) / term) ** 2)) * (term + S))/ (t + X)) / k)
            rhs = (((t + X) / (term + S)) * term) / t
            if lhs <= rhs:
                W = rhs / lhs
                break
            y = (((U * (t + 1)) / term) * (t + S + 1)) / (t + X)
            if k < S:
                denom = t
                numer_lim = term + S
            else:
                denom = t - k + S
                numer_lim = t + 1
            for numer in xrange(t + S, numer_lim - 1, -1):
                y *= numer / denom
                denom -= 1
            W = exp(-log(random()) / k)
            if exp(log(y) / k) <= (t + X) / t:
                break
            next(islice(it, S, S), None)
            next_item = next(it, None)
            if next_item:
                M = randrange(0, k)
                reservoir[M] = next_item
                t += S + 1
                term += S + 1
            else:
                break        
        return reservoir
    
  4. Paul said

    There is an error in algorithm_Z of my last post. I corrected it and the full code including test code can be found at here. Timing for Z is now 0.35 sec.

  5. r. clayton said

    In Guile Scheme.

  6. Daniel said

    Here’s a solution in Julia.

    function reservoirSample(iter, k)
        sample = Array(eltype(iter), k)
        i = 1
        for elt in iter
            if i <= k
                sample[i] = elt
            elseif (r = rand(1:i)) <= k
                sample[r] = elt
            end
            i = i + 1
        end
        return sample
    end
    

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: