Two Stream Selection Questions
February 6, 2015
The first problem has a well-known solution due to Robert Floyd: the first item is selected with probability 1/1, the second item replaces the current selection with probability 1/2, the third item replaces that selection with probability 1/3, and so on, so that the kth item is selected with probability 1/k:
(define (select-equal xs)
(let loop ((n 1) (x #f) (xs xs))
(cond ((null? xs) x)
((< (rand) (/ n))
(loop (+ n 1) (car xs) (cdr xs)))
(else (loop (+ n 1) x (cdr xs))))))
> (select-equal '(A D F A G))
A
This algorithm appears in the Standard Prelude under the name fortune
, which derives rom the unix game of the same name that randomly selects an epigram from a file containing one per line.
The second problem is a simple variant of the first; instead of adding 1 at each step, we add the current weight. Thus, in our sample, A is selected with probability 1/1, D replaces A with probability 2/3, F replaces the current selection with probability 5/8, A replaces the current selection with probability 8/11, and G replaces the current selection with probability 9/20:
(define (select-weighted xs)
(let loop ((n 0) (x #f) (xs xs))
(if (null? xs) x
(let ((y (caar xs)))
(if (< (rand) (/ y (+ n y)))
(loop (+ n y) (cadar xs) (cdr xs))
(loop (+ n y) x (cdr xs)))))))
> (select-weighted '((1 A) (2 D) (5 F) (3 A) (9 G)))
G
We need to demonstrate that the two functions make selections in the expected proportions. Function count
makes a table showing the number of times each item is selected during n trials:
(define (symbol<? a b)
(string<? (symbol->string a) (symbol->string b)))
(define (symbol=? a b)
(string=? (symbol->string a) (symbol->string b)))
(define (count n select input)
(uniq-c symbol=?
(sort symbol<?
(map (select input)
(range n)))))
And here’s the output for the two sets of trials, clearly showing the correct counts:
> (for-each
(lambda (x)
(display (car x)) (display #\tab)
(display (cdr x)) (newline))
(count 50000 select-equal '(A D F A G)))
A 20095
D 9840
F 10119
G 9946
> (for-each
(lambda (x)
(display (car x)) (display #\tab)
(display (cdr x)) (newline))
(count 200000 select-weighted '((1 A) (2 D) (5 F) (3 A) (9 G))))
A 40273
D 20134
F 49914
G 89679
We used range, uniq-c and random numbers from the Standard Prelude. You can run the program at http://programmingpraxis.codepad.org/8yiHSI59.
Examples:
>>> items = (‘red’, ‘blue’, ‘green’)
>>> Counter(random_pick(items) for _ in range(1000))
Counter({‘green’: 353, ‘red’: 324, ‘blue’: 323})
>>> weights = (0.5, 1.0, 1.5)
>>> Counter(random_pick(items, weights) for _ in range(1000))
Counter({‘green’: 519, ‘blue’: 317, ‘red’: 164})
>>> from fractions import *
>>> weights = (Fraction(1,2), Fraction(1,3), Fraction(1,6))
>>> Counter(random_pick(items, weights) for _ in range(1000))
Counter({‘red’: 504, ‘blue’: 344, ‘green’: 152})
In golang: http://xojoc.pw/programming-praxis/reservoir-sampling.html