## Reservoir Sampling

### January 31, 2014

If you want to choose a sample of k out of n items, and the n items are stored in an array, choosing the sample is easy: just generate k random numbers from 0 to n−1, without duplicates, and look up the associated array items. But that’s harder to do if the items are stored in a list, because indexing into a list is O(n) instead of O(1), and it’s impossible if the list is too large to fit in memory. The Standard Prelude includes a function that works when k = 1, but in the general case we must use a streaming algorithm called reservoir sampling.

The algorithm creates an array of length k, the reservoir, and initializes it with the first k elements of the input list. Then the remaining nk elements of list are each considered, one by one, in order. When the ith element of the input list is reached, a random integer j less than i is generated; if j < k, the jth element of the reservoir array is replaced by the ith element of the list. When all elements of the list have been considered, the items that remain in the reservoir array are returned as the k randomly-selected elements of the input list.

Your task is to write a program that performs reservoir sampling. When you are finished, you are welcome to read or run a suggested solution, or to post your own solution or discuss the exercise in the comments below.

Pages: 1 2

### 6 Responses to “Reservoir Sampling”

1. Paul said

In Python. Note that it is important to choose a random number inclusive i. Otherwise the first k items in the sequence will be undersampled. What is interesting about this form of reservir sampling is that you do not have to know the number of items in advance. On the downside, it is slow, as you need a random sample for every item.

```from itertools import islice, count, izip
from random import randrange

def reservoir_sample(seq, k):
'take a sample with k items fom seq'
it = iter(seq)
reservoir = list(islice(it, k))
for i, item in izip(count(k), it):
j = randrange(0, i + 1) # 0 <= j <= i !!
if j < k:
reservoir[j] = item
return reservoir
```
2. Jussi Piitulainen said

Fun! Python or something. (Same as Paul’s but different.)

```from random import randint
from itertools import repeat, count
from functools import partial
from operator import getitem, gt
def compose(g, f):
def gf(a):
return g(f(a))
return gf
def swap(f):
def t(y,x):
return f(x,y)
return t
def sample(source, m):
source = iter(source)
result = list(map(next, repeat(source, m)))
for item, j in filter(compose(partial(gt, m),
partial(swap(getitem), 1)),
zip(source,
map(partial(randint, 0), count(m)))):
result[j] = item
return result
```

Some test results:
``` 100-sampling [0, 1, ..., 999] 1000 times: population mean: 499.5 mean deviation from population mean: 0.027230000000000528```

``` ```

```2-sampling [3, 1, 4] 999 times: 335 [4, 1] 328 [3, 1] 337 [3, 4] ```

3. Paul said

In this paper various methods for reservoir sampling are given. The simplest is algorithm R and is the same as suggested above. Also 2 more algorithms are given, named X and Z. I give here implementations. Timings for k=10 (or 100) and N=10000000 are R: 13 sec, X: 2.4 sec and Z: 0.17 sec. (I left out the code for R, as that is the same as my last post.

```from __future__ import division

from itertools import islice, count, izip
from random import randrange, random
from math import exp, log

def algorithm_X(seq, k):
it = iter(seq)
reservoir = list(islice(it, k))
t = k
num = 0
while 1:
V = random()
S = 0
t += 1
num += 1
quot = num / t
while quot > V:
S += 1
t += 1
num += 1
quot *= num / t
next(islice(it, S, S), None)
next_item = next(it, None)
if next_item:
M = randrange(0, k)
reservoir[M] = next_item
else:
break
return reservoir

def algorithm_Z(seq, k, T=22):
it = iter(seq)
reservoir = list(islice(it, k))
t = k
tresh = T * k
num = 0
while t <= tresh:
V = random()
S = 0
t += 1
num += 1
quot = num / t
while quot > V:
S += 1
t += 1
num += 1
quot *= num / t
next(islice(it, S, S), None)
next_item = next(it, None)
if next_item:
M = randrange(0, k)
reservoir[M] = next_item
else:
break
W = exp(-log(random()) / k)
term = t - k + 1
while 1:
U = random()
X = t * (W - 1)
S = int(X)
lhs = exp(log(((U *(((t + 1) / term) ** 2)) * (term + S))/ (t + X)) / k)
rhs = (((t + X) / (term + S)) * term) / t
if lhs <= rhs:
W = rhs / lhs
break
y = (((U * (t + 1)) / term) * (t + S + 1)) / (t + X)
if k < S:
denom = t
numer_lim = term + S
else:
denom = t - k + S
numer_lim = t + 1
for numer in xrange(t + S, numer_lim - 1, -1):
y *= numer / denom
denom -= 1
W = exp(-log(random()) / k)
if exp(log(y) / k) <= (t + X) / t:
break
next(islice(it, S, S), None)
next_item = next(it, None)
if next_item:
M = randrange(0, k)
reservoir[M] = next_item
t += S + 1
term += S + 1
else:
break
return reservoir
```
4. Paul said

There is an error in algorithm_Z of my last post. I corrected it and the full code including test code can be found at here. Timing for Z is now 0.35 sec.

5. r. clayton said

In Guile Scheme.

6. Daniel said

Here’s a solution in Julia.

```function reservoirSample(iter, k)
sample = Array(eltype(iter), k)
i = 1
for elt in iter
if i <= k
sample[i] = elt
elseif (r = rand(1:i)) <= k
sample[r] = elt
end
i = i + 1
end
return sample
end
```