Selection, Revisited

November 30, 2012

We studied a selection algorithm in a previous exercise. The algorithm is the same as quicksort, except that recursion follows only the partition in which the target of the selection is located. We used a randomized algorithm to select the pivot, which gives an expected O(n) time complexity but a worst-case O(n^2) time complexity. In today’s exercise we examine an algorithm due to Blum, Floyd, Pratt, Rivest and Tarjan from their 1973 paper “Time Bounds for Selection” in Journal of Computer Systems Science that provides guaranteed O(n) time complexity (actually, the time complexity is sub-linear, but the only claim is that it is linear).

The algorithm is the same as the previous exercise except in the selection of the pivot. Instead of a random pivot, the algorithm partitions the input into blocks of five elements, finds the median of each block by comparisons, then chooses the median of the medians (which is computed recursively) as the pivot. Thus, the algorithm is known as “selection by the median of the medians of five.”

Your task is to write a function that selects the kth item from a list. When you are finished, you are welcome to read or run a suggested solution, or to post your own solution or discuss the exercise in the comments below.

Posted by programmingpraxis

Filed in Exercises

5 Comments »

5 Responses to “Selection, Revisited”

Selection, Revisited | My Blog said
November 30, 2012 at 3:34 PM
[…] Pages: 1 2 […]

cosmin said

December 2, 2012 at 12:27 PM

Even if the algorithm is very clear and straight forward, a careless implementation might contain an annoying bug with duplicates. For example, if at some point in the recursion, all the elements are equal suddenly it is not that easy to split the list in two sub-lists because one of them will be empty. I’ve used a very simple antidote by mapping the element a[i] into (a[i], i) transforming the initial list into a list of distinct pairs. Below is my Python implementation:

def kth_smallest(k, a):
	res = kth_smallest_no_dup(k, [(value, idx) for idx, value in enumerate(a)])
	return res[0]

def kth_smallest_no_dup(k, a):
	n = len(a)
	# solve the particular cases when n is small enough
	if n <= 5: return sorted(a)[k - 1]
	# split the list in groups of 5 elements and compute their medians
	medians = []
	for i in xrange((n + 4) // 5):
		start, end = 5*i, min(5*(i + 1), n)
		b = sorted(a[start:end])
		medians.append(b[(end - start) // 2])
	# find the median of the medians
	splitElm = kth_smallest_no_dup((1 + len(medians)) // 2, medians)
	# split the list in two sub-lists using splitElm as a pivot
	a1 = [e for e in a if e <= splitElm]
	a2 = [e for e in a if e > splitElm]
	# decide in which of the two sub-lists the k-th element is
	if k <= len(a1):
		return kth_smallest_no_dup(k, a1)
	else:
		return kth_smallest_no_dup(k - len(a1), a2)

print kth_smallest(4, [1, 1, 1, 1, 1, 1, 3, 2, 7, 4, 9, 6, 5, 8])

Paul said

December 3, 2012 at 3:27 PM

Another Python version. A key can be given to determine sort order.

from random import choice

RANDOM, MEDIANS = range(2)

def split_on_pivot(L, pivot):
    lt, eq, gt = [], [], []
    lapp, gapp, eapp = lt.append, gt.append, eq.append
    for e in L:
        if e < pivot:
            lapp(e)
        elif e > pivot:
            gapp(e)
        else:
            eapp(e)
    return lt, eq, gt

def select(data, n, key=None, mode=RANDOM):
    """Find the nth rank ordered element (the least value has rank 1).
     data: list of of objects that can be compared 
     n: rank of data element
     key: function of one variable, that determines sort order
     mode: RANDOM chooses random pivot
           MEDIANS chooses medians of medians as pivot
    """
    if not 0 < n <= len(data):
        raise ValueError('not enough elements for the given rank')
    if key:
        decorated = [(key(e), i) for i, e in enumerate(data)]
        res = _select(decorated, n, mode)
        return data[res[1]]
    else:
        return _select(data, n, mode)
    
def _select(data, n, mode=RANDOM):
    """ recursive selection 
    """
    pivot = choice(data) if mode == RANDOM else median_of_medians(data)
    lower, equal, higher = split_on_pivot(data, pivot)
    len_lower, len_eq = len(lower), len(equal)
    if  len_lower < n <= len_lower + len_eq: 
        return pivot        
    if n <= len_lower:
        return _select(lower, n, mode)
    else:
        return _select(higher, n - len_lower - len_eq, mode)
        
def median5(L):
    """returns a "median" of L
    Strictly the median should be the average of the middle 2 elements, if n is 
    even, so this is not a true median for an even number of elements.
    This median always returns an element of L
    """
    return sorted(L)[len(L)//2]
    
def median_of_medians(L):
    """" Subdivides L into partitions of 5 elements, and calculated the median
         of medians of the partitions
    """
    if len(L) <= 5:
        return median5(L)
    return median_of_medians([median5(L[i:i+5]) for i in xrange(0, len(L), 5)])

treeowl said
December 12, 2012 at 7:11 AM
Sub-linear? What do you mean by that? You can’t possibly select the kth item without examining all n items. Even the trivial cases of k=1 and k=n are linear.
treeowl said
December 12, 2012 at 7:13 AM
A question: when you’re trying to find something other than the median, is it best to pivot about an approximate median (as this algorithm does), or is it possible to save time by choosing a different pivot? I couldn’t figure that out.

S	M	T	W	T	F	S
				1	2	3
4	5	6	7	8	9	10
11	12	13	14	15	16	17
18	19	20	21	22	23	24
25	26	27	28	29	30

Programming Praxis