Reservoir Sampling
January 31, 2014
Our code is very concise:
(define (sample k xs)
(let ((resv (list->vector (take k xs))))
(do ((xs (drop k xs) (cdr xs)) (i k (+ i 1))) ((null? xs) resv)
(let ((j (randint i)))
(when (< j k) (vector-set! resv j (car xs)))))))
And here are some examples:
> (define xs '(a b c d e f g h i j k l m n o p q r s t u v w x y z))
> (sample 3 xs)
#(l z r)
> (sample 3 xs)
#(y g h)
> (sample 3 xs)
#(d k i)
We used take and drop from the Standard Prelude, and rand and randint from a previous exercise. You can run the program at http://programmingpraxis.codepad.org/VgxH2mwk.
In Python. Note that it is important to choose a random number inclusive i. Otherwise the first k items in the sequence will be undersampled. What is interesting about this form of reservir sampling is that you do not have to know the number of items in advance. On the downside, it is slow, as you need a random sample for every item.
from itertools import islice, count, izip from random import randrange def reservoir_sample(seq, k): 'take a sample with k items fom seq' it = iter(seq) reservoir = list(islice(it, k)) for i, item in izip(count(k), it): j = randrange(0, i + 1) # 0 <= j <= i !! if j < k: reservoir[j] = item return reservoirFun! Python or something. (Same as Paul’s but different.)
from random import randint from itertools import repeat, count from functools import partial from operator import getitem, gt def compose(g, f): def gf(a): return g(f(a)) return gf def swap(f): def t(y,x): return f(x,y) return t def sample(source, m): source = iter(source) result = list(map(next, repeat(source, m))) for item, j in filter(compose(partial(gt, m), partial(swap(getitem), 1)), zip(source, map(partial(randint, 0), count(m)))): result[j] = item return resultSome test results:
100-sampling [0, 1, ..., 999] 1000 times:
population mean: 499.5
mean deviation from population mean: 0.027230000000000528
2-sampling [3, 1, 4] 999 times:
335 [4, 1]
328 [3, 1]
337 [3, 4]
In this paper various methods for reservoir sampling are given. The simplest is algorithm R and is the same as suggested above. Also 2 more algorithms are given, named X and Z. I give here implementations. Timings for k=10 (or 100) and N=10000000 are R: 13 sec, X: 2.4 sec and Z: 0.17 sec. (I left out the code for R, as that is the same as my last post.
from __future__ import division from itertools import islice, count, izip from random import randrange, random from math import exp, log def algorithm_X(seq, k): it = iter(seq) reservoir = list(islice(it, k)) t = k num = 0 while 1: V = random() S = 0 t += 1 num += 1 quot = num / t while quot > V: S += 1 t += 1 num += 1 quot *= num / t next(islice(it, S, S), None) next_item = next(it, None) if next_item: M = randrange(0, k) reservoir[M] = next_item else: break return reservoir def algorithm_Z(seq, k, T=22): it = iter(seq) reservoir = list(islice(it, k)) t = k tresh = T * k num = 0 while t <= tresh: V = random() S = 0 t += 1 num += 1 quot = num / t while quot > V: S += 1 t += 1 num += 1 quot *= num / t next(islice(it, S, S), None) next_item = next(it, None) if next_item: M = randrange(0, k) reservoir[M] = next_item else: break W = exp(-log(random()) / k) term = t - k + 1 while 1: U = random() X = t * (W - 1) S = int(X) lhs = exp(log(((U *(((t + 1) / term) ** 2)) * (term + S))/ (t + X)) / k) rhs = (((t + X) / (term + S)) * term) / t if lhs <= rhs: W = rhs / lhs break y = (((U * (t + 1)) / term) * (t + S + 1)) / (t + X) if k < S: denom = t numer_lim = term + S else: denom = t - k + S numer_lim = t + 1 for numer in xrange(t + S, numer_lim - 1, -1): y *= numer / denom denom -= 1 W = exp(-log(random()) / k) if exp(log(y) / k) <= (t + X) / t: break next(islice(it, S, S), None) next_item = next(it, None) if next_item: M = randrange(0, k) reservoir[M] = next_item t += S + 1 term += S + 1 else: break return reservoirThere is an error in algorithm_Z of my last post. I corrected it and the full code including test code can be found at here. Timing for Z is now 0.35 sec.
In Guile Scheme.
Here’s a solution in Julia.
function reservoirSample(iter, k) sample = Array(eltype(iter), k) i = 1 for elt in iter if i <= k sample[i] = elt elseif (r = rand(1:i)) <= k sample[r] = elt end i = i + 1 end return sample end