## Selection, Revisited

### November 30, 2012

The primary `select` function is recursive. The first `if` stops the recursion when the length of the list is small, when it sorts to find the median. Then we calculate the median of the median of five in the variable m, and use that to partition the input, recurring on whichever of the less-than or greater-than pieces contains k; when the pivot is equal to k, recursion stops.

```(define (select lt? xs k)   (define (median5 xs)     (select lt? xs (quotient (+ (length xs) 1) 2)))   (let ((len (length xs)))     (if (< len 10) (list-ref (sort lt? xs) (- k 1))       (let* ((ts (map median5 (split5 xs)))              (m (select lt? ts (quotient len 10))))         (call-with-values           (lambda () (partition lt? xs m))           (lambda (lt eq gt)             (let ((lt-len (length lt)) (eq-len (length eq)))               (cond ((<= k lt-len)                       (select lt? lt k))                     ((< (+ lt-len eq-len) k)                       (select lt? gt (- k lt-len eq-len)))                     (else m)))))))))```

The other pieces are standard. `Partition` scans a list, putting elements in bins as they are less than, greater than, or equal to the pivot. `Split5` repeatedly calls `split` from the Standard Prelude to partition the input into blocks of five.

```(define (partition lt? xs x)   (let loop ((xs xs) (lt (list)) (eq (list)) (gt (list)))     (cond ((null? xs) (values lt eq gt))           ((lt? (car xs) x) (loop (cdr xs) (cons (car xs) lt) eq gt))           ((lt? x (car xs)) (loop (cdr xs) lt eq (cons (car xs) gt)))           (else (loop (cdr xs) lt (cons (car xs) eq) gt)))))```

```(define (split5 xs)   (let loop ((xs xs) (xss (list)))     (if (null? xs) (reverse xss)       (call-with-values         (lambda () (split 5 xs))         (lambda (head tail)           (loop tail (cons head xss)))))))```

Though this algorithm has guaranteed linear run time, it is typically slower than the randomized algorithm because of the time required to calculate the median of the medians of five. Thus, the randomized algorithm is probably the preferred algorithm for most purposes.

You can run the program at http://programmingpraxis.codepad.org/yLPKhxuw.

Pages: 1 2

### 5 Responses to “Selection, Revisited”

1. [...] Pages: 1 2 [...]

2. cosmin said

Even if the algorithm is very clear and straight forward, a careless implementation might contain an annoying bug with duplicates. For example, if at some point in the recursion, all the elements are equal suddenly it is not that easy to split the list in two sub-lists because one of them will be empty. I’ve used a very simple antidote by mapping the element a[i] into (a[i], i) transforming the initial list into a list of distinct pairs. Below is my Python implementation:

```def kth_smallest(k, a):
res = kth_smallest_no_dup(k, [(value, idx) for idx, value in enumerate(a)])
return res[0]

def kth_smallest_no_dup(k, a):
n = len(a)
# solve the particular cases when n is small enough
if n <= 5: return sorted(a)[k - 1]
# split the list in groups of 5 elements and compute their medians
medians = []
for i in xrange((n + 4) // 5):
start, end = 5*i, min(5*(i + 1), n)
b = sorted(a[start:end])
medians.append(b[(end - start) // 2])
# find the median of the medians
splitElm = kth_smallest_no_dup((1 + len(medians)) // 2, medians)
# split the list in two sub-lists using splitElm as a pivot
a1 = [e for e in a if e <= splitElm]
a2 = [e for e in a if e > splitElm]
# decide in which of the two sub-lists the k-th element is
if k <= len(a1):
return kth_smallest_no_dup(k, a1)
else:
return kth_smallest_no_dup(k - len(a1), a2)

print kth_smallest(4, [1, 1, 1, 1, 1, 1, 3, 2, 7, 4, 9, 6, 5, 8])
```
3. Paul said

Another Python version. A key can be given to determine sort order.

```from random import choice

RANDOM, MEDIANS = range(2)

def split_on_pivot(L, pivot):
lt, eq, gt = [], [], []
lapp, gapp, eapp = lt.append, gt.append, eq.append
for e in L:
if e < pivot:
lapp(e)
elif e > pivot:
gapp(e)
else:
eapp(e)
return lt, eq, gt

def select(data, n, key=None, mode=RANDOM):
"""Find the nth rank ordered element (the least value has rank 1).
data: list of of objects that can be compared
n: rank of data element
key: function of one variable, that determines sort order
mode: RANDOM chooses random pivot
MEDIANS chooses medians of medians as pivot
"""
if not 0 < n <= len(data):
raise ValueError('not enough elements for the given rank')
if key:
decorated = [(key(e), i) for i, e in enumerate(data)]
res = _select(decorated, n, mode)
return data[res[1]]
else:
return _select(data, n, mode)

def _select(data, n, mode=RANDOM):
""" recursive selection
"""
pivot = choice(data) if mode == RANDOM else median_of_medians(data)
lower, equal, higher = split_on_pivot(data, pivot)
len_lower, len_eq = len(lower), len(equal)
if  len_lower < n <= len_lower + len_eq:
return pivot
if n <= len_lower:
return _select(lower, n, mode)
else:
return _select(higher, n - len_lower - len_eq, mode)

def median5(L):
"""returns a "median" of L
Strictly the median should be the average of the middle 2 elements, if n is
even, so this is not a true median for an even number of elements.
This median always returns an element of L
"""
return sorted(L)[len(L)//2]

def median_of_medians(L):
"""" Subdivides L into partitions of 5 elements, and calculated the median
of medians of the partitions
"""
if len(L) <= 5:
return median5(L)
return median_of_medians([median5(L[i:i+5]) for i in xrange(0, len(L), 5)])
```
4. treeowl said

Sub-linear? What do you mean by that? You can’t possibly select the kth item without examining all n items. Even the trivial cases of k=1 and k=n are linear.

5. treeowl said

A question: when you’re trying to find something other than the median, is it best to pivot about an approximate median (as this algorithm does), or is it possible to save time by choosing a different pivot? I couldn’t figure that out.