Reservoir Sampling
January 31, 2014
If you want to choose a sample of k out of n items, and the n items are stored in an array, choosing the sample is easy: just generate k random numbers from 0 to n−1, without duplicates, and look up the associated array items. But that’s harder to do if the items are stored in a list, because indexing into a list is O(n) instead of O(1), and it’s impossible if the list is too large to fit in memory. The Standard Prelude includes a function that works when k = 1, but in the general case we must use a streaming algorithm called reservoir sampling.
The algorithm creates an array of length k, the reservoir, and initializes it with the first k elements of the input list. Then the remaining n − k elements of list are each considered, one by one, in order. When the ith element of the input list is reached, a random integer j less than i is generated; if j < k, the jth element of the reservoir array is replaced by the ith element of the list. When all elements of the list have been considered, the items that remain in the reservoir array are returned as the k randomly-selected elements of the input list.
Your task is to write a program that performs reservoir sampling. When you are finished, you are welcome to read or run a suggested solution, or to post your own solution or discuss the exercise in the comments below.