Selection
December 11, 2009
In statistics, the median is the smallest item that is greater than half the values in a set. Statisticians also define statistics such as quartiles, deciles and percentiles that show the value of an item relative to the other items in a set. A function known as select
allows us to find the kth smallest item in a set; that is, the smallest item for which k of the items in the set are smaller.
A simple algorithm for select
has us first partition the set into two subsets that are respectively smaller and larger than some randomly-chosen item from the set. If the smaller-than set has more than k elements, it must contain the desired element, so search recursively for the desired element in the smaller-than set; otherwise, subtract the size of the smaller-than set k, then search recursively for the desired element in the larger-than set.
Your task is to write a function that implements the selection algorithm. When you are finished, you are welcome to read or run a suggested solution, or to post your solution or discuss the exercise in the comments below.
My solution in Ruby:
def select(k,array)
elem = array[array.size / 2]
smaller = array.find_all { |e| e < elem }
larger = array.find_all { |e| e >= elem }
if(smaller.size == k)
return elem
end
if (smaller.size > k)
select(k,smaller)
else
select(k – smaller.size,larger)
end
end
Python solution:
Unefficient Common Lisp, but very direct…
1 (defun select (k li)
2 (when (>= k 0)
3 (nth k (sort li #'<))))
Jos Koot pointed out a bug:
(select 3 '(1 1 2 2 3 3 4 4 5 5 6 6 7 7)) results in an infinite loop. The bug manifests itself whenever the kth element has a duplicate.
Both
select
andpartition
have to change to fix the bug. Here ispartition
, where the(cdr xs)
in the initialization of the loop ensures that at least one element is removed from the list at each iteration, thus forcing progress toward termination:(define (partition xs)
(let ((x (car xs)))
(let loop ((xs (cdr xs)) (lt '()) (ge '()))
(cond ((null? xs) (values lt x ge))
((< (car xs) x)
(loop (cdr xs) (cons (car xs) lt) ge))
(else (loop (cdr xs) lt (cons (car xs) ge)))))))
Select
also has to change, because the partition element is no longer included in the "greater-than" partition:(define (select k xs)
(if (<= (length xs) k)
(error 'select "out of range")
(let loop ((k k) (xs (shuffle xs)))
(let-values (((lt x ge) (partition xs)))
(cond ((< k (length lt)) (loop k lt))
((< (length lt) k) (loop (- k (length lt) 1) ge))
(else x))))))
The variable name
gt
is changed toge
to reflect that equal elements are possible, and are (arbitrarily) placed in the "greater-than" partition. The order of the return elements frompartition
is also changed, so that when reading debug output frompartition
the elements are physically displayed in their proper order, a small point that nonetheless improves the program.My implementation in C
Here is a java method to accomplish the same. I think there is a better solution feasible. Please suggest/comment.
public static int select(int[] a, int k)
{
int length = a.length;
int randomElementIndex = (new Long(Math.round((length-1)*Math.random()))).intValue();
int randomElement = a[randomElementIndex];
//Lower and Upper Limits of k are handled in the following loop.
if (k == 0 || k==length)
{
Arrays.sort(a);
return k==0?a[0]:a[length-1];
}
int[] leftTransientArray = new int[length];
int[] rightTransientArray = new int[length];
int laIndex = 0, raIndex = 0;
for(int i=0; i<length; i++)
{
if (a[i] randomElement)
{
rightTransientArray[raIndex]=a[i];
raIndex++;
}
}
int[] leftArray = Arrays.copyOfRange(leftTransientArray, 0, laIndex);
int[] rightArray = Arrays.copyOfRange(rightTransientArray, 0, raIndex);
if (leftArray.length == k+1)
return randomElement;
else if (leftArray.length <= k )
return select(rightArray, k-leftArray.length);
else
return select(leftArray, k);
}