## Maximum K Items Out Of N

### August 30, 2016

The first solution sorts *n* items, then takes the highest *k* items:

(define (max-k xs) (take k (sort > xs))) > (max-k 5 '(3 9 6 1 5 6 4 2 8 7 6)) (9 8 7 6 6)

That’s the solution I would probably use in most production systems; it’s hard to get wrong and likely to be very fast. Due to the sort, the time complexity is O(*n* log *n*).

The second solution uses a priority queue of fixed size *k*: Insert items into the priority queue, then extract them when the input is exhausted. We use the implementation of the *k*th largest heap from a previous exercise:

(define (kth-largest-heap k xs) (let loop ((xs xs) (k k) (pq pq-empty)) (if (positive? k) (loop (cdr xs) (- k 1) (pq-insert < (car xs) pq)) (let loop ((xs xs) (pq pq)) (if (pair? xs) (if (< (car xs) (pq-first pq)) (loop (cdr xs) pq) (loop (cdr xs) (pq-insert < (car xs) (pq-rest < pq)))) (let loop ((pq pq) (zs (list))) (if (pq-empty? pq) zs (loop (pq-rest < pq) (cons (pq-first pq) zs))))))))) > (kth-largest-heap 5 '(3 9 6 1 5 6 4 2 8 7 6))) (9 8 7 6 6)

That solution is O(*n* log *k*): each priority queue operation takes time log *k*, and each of the *n* items must be processed.

The third solution uses selection to find the *k*th largest item, then scans the input to find all items greater than the *k*th largest item, then adds enough copies of the *k*th largest item to give the output a length of k. That’s tricky, but required to handle duplicates properly; for instance, given the input (9 8 7 6 6) with k = 4, the correct output is (9 8 7 6), with only one copy of the duplicate 6 included in the output.

(define (k-largest k xs) (let ((t (select > xs k))) (let ((zs (filter (lambda (x) (> x t)) xs))) (let ((len (length zs))) (append zs (make-list (- k len) t)))))) > (k-largest 5 '(3 9 6 1 5 6 4 2 8 7 6))) (9 8 7 6 6)

That solution is O(*n*): selection takes O(*n*), so does the pass that selects items greater than the *k*th largest, and so does the pass that adds copies of the *k*th largest.

You can run the program at http://ideone.com/o1mKIJ, where you will also see the code contributed from previous exercises.

In Python. Although bucketsort is not guaranteed linear time, perhaps I should have used that magic selection algorithm with the partitioning into 5 medians of medians.

@Rutger. If you use heapq.nlargest, it is not necessary to heapify first. This only costs time and achieves nothing. Also, heapify does not return anything.

My 2 cents, in Common Lisp :

(defun max-k-items (k items)

(subseq (sort (copy-list items) #’>) 0 k))

Here’s a solution in C++11.

MaxItems1 sorts the list and gets the top k items. O(n log n).

MaxItems2 uses a heap of k items, with the smallest at the head. O(n log k).

MaxItems3 uses std::nth_element. I used C++ for this problem particularly for that function. However I realized in the documentation that it’s O(n) average case, not O(n) worst case.

MaxItems4 uses an implementation of median of medians. I converted the pseudocode on Wikipedia to C++. This is O(n) worst case.

Output:

MaxItems4 in my last post worked and was O(n), but was doing more work than necessary.

The updated version below of the partition5 function (used for the median of medians algorithm) fixes the issue.

I do not think, that a solution based on a heap is nlogk. The time complexity for heapify is O(n) (see Introduction to Algorithms of Cormen et al.) I created a solution using a heap of k items and a solution using a full heap of all data. I could not see a significant difference between these two. They were also giving the same timings as heapq.nlargest. All timings were consistent with O(n). All this in Python. Do other languages behave differently?

Hi Paul. I used a k-element heap for my second solution (MaxItems2). I think the approach is O(n log k) because it 1) loops over the n integers, then for each iteration of the loop it 2) checks the minimum value in the heap and possibly 3a) removes the head of the heap and possibly 3b) inserts a new value into the heap. 1 is O(n), 2 is O(1), 3a is O(log k), and 3b is O(log k), making the entire algorithm O(n log k).

> “The time complexity for heapify is O(n)”

Can you elaborate on this comment? I believe you’re referring to an optimal way of building a heap that’s an alternative to successive insertions. For a k-element heap, the optimal approach would be O(k) and the approach using successive insertions would be O(k log k).

I used successive insertions, so I believe the relevant step to consider is reheapify. Both reheapification upward for insert and reheapification downward for remove are O(log k). For example, in a k-element heap, the height of the tree is log(k). Then for reheapify upward, a new leaf node is swapped with a parent iteratively until it’s in the correct place, which is O(log k). This corresponds to steps 3a and 3b that I described above.

Paul: Heapify is indeed O(n), but you’re not performing heapify on a heap of size n. The correct analysis is that there are n heap operations, each taking log k time, so a total time of O(n log k).

Daniel: Simple analysis of heapify shows n heap operations that each take log n, for a total time complexity of O(n log n). But that bound is not tight. I’m looking at CLRS3e 6.3, but any algorithms text will give a similar analysis. The half of the nodes that are leaf nodes are themselves trivial heaps, and take no time at all. Building the heap above them considers the height h of the half-heap, each heap operation takes O(log h), and there are n / 2 ** (h+1) heap operations, which works out to O(n) if you do the math.

@PP and @ Daniel. I posted some code and timings on ideone. I understand all arguments that you both gave, but the timings do not show the logk dependence really.

Hi programmingpraxis. The “successive insertions” method I was referring to is the same as Williams’ method described here: https://en.wikipedia.org/wiki/Binary_heap#Building_a_heap

Here’s another solution to the problem. It builds a heap with all n elements, which is O(n), and then gets the top k elements from the heap, which is O(k log n).

Output:

Hi Paul, I believe the dependence becomes clearer after changing n to 6, and also looping through a wider range of k.

https://ideone.com/kqHpJX

That won’t run due to a time limit being exceeded. The only change is that n was increased from 5 to 6, and k loops over (1, 10, 100, 1000, 10000, 100000, 1000000).

> “I understand all arguments that you both gave, but the timings do not show the logk dependence really.”

– Paul @ September 1, 2016 at 3:11 PM

In my last post, I was using “dependence” in reference to your quote above.

Please ignore my last two posts. I’m not sure I understand the issue.

Hi Paul,

I wrote earlier that my heap-based solution, MaxItems2, was O(n log k). My implementation was doing more work than necessary, as it was emptying the priority queue in sorted order, which required k iterations, each O(log k). I fixed that below, as the problem does not specify a specific ordering of the output. The array as-ordered underlying the heap is now used as the output.

> “Do other languages behave differently?”

– Paul @ September 1, 2016 at 1:18 PM

The following code sets n to 16,777,216. It then calls MaxItems2 for varying values of k, and reports the elapsed clock ticks. Compiler optimizations were disabled. With n fixed, the complexity is O(log k).

Output: