## Hoare’s Partition, Improved

### November 1, 2016

We begin by writing a timing program, so we can measure our progress. The timing program is just a modification of the test program of the previous exercise, eliminating the initial test of an already-sorted input:

(define (time-quicksort k n) (let ((ary (shuffle (list->vector (range k))))) (do ((n n (- n 1))) ((zero? n)) (quicksort < ary 0 k) (unless (sorted? < ary) (display "ERROR ") (display ary) (newline)) (shuffle ary))))

Timing that function test shuffling and checking as well as sorting, but the shuffle takes the same amount of time each time it is done, so improvements are due solely to the sorting. Here we sort ten arrays of a million elements each, about three-quarters of a second per shuffle, sort and check:

> (time (time-quicksort 1000000 10)) (time (time-quicksort 1000000 ...)) 42 collections 7.235913936s elapsed cpu time, including 0.070033656s collecting 7.246311917s elapsed real time, including 0.070413646s collecting 360026640 bytes allocated, including 346672480 bytes reclaimed

We begin by moving the comparison and swap inline; only the `partition`

function needs to change:

(define (partition lt? ary lo hi) (let ((x (vector-ref ary lo)) (i (- lo 1)) (j hi)) (let forever () (do-until (set! j (- j 1)) (not (lt? x (vector-ref ary j)))) (do-until (set! i (+ i 1)) (not (lt? (vector-ref ary i) x))) (when (< i j) (let ((t (vector-ref ary i))) (vector-set! ary i (vector-ref ary j)) (vector-set! ary j t)) (forever))) (values j ary)))

This simple change already gives us a significant improvement:

> (time (time-quicksort 1000000 10)) (time (time-quicksort 1000000 ...)) 41 collections 5.931337461s elapsed cpu time, including 0.057270243s collecting 5.943404523s elapsed real time, including 0.057907239s collecting 360026000 bytes allocated, including 352237216 bytes reclaimed

The next improvement stops recursion when the sub-array becomes small, because quicksort makes many recursive calls to small sub-arrays where little sorting progress is made, at the cost of a large function-calling overhead. When the sub-array has less items than the *cutoff* value, the recursive call returns without doing anything. Then, after quicksort is complete, the array is nearly-sorted, and calling insertion sort completes the sort. Our `quicksort`

function changes like this:

(define cutoff 15)

(define (quicksort lt? ary lo hi) (if (< cutoff (- hi lo)) (call-with-values (lambda () (partition lt? ary lo hi)) (lambda (p ary) (cond ((< (- p lo) (- hi p)) (quicksort lt? ary lo (+ p 1)) (quicksort lt? ary (+ p 1) hi)) (else (quicksort lt? ary (+ p 1) hi) (quicksort lt? ary lo (+ p 1)))))) ary))

We make *cutoff* a parameter so we can easily experiment with multiple values; in several experiments, values from 10 to 20 seem to work best, so we choose 15. Then, sorting is a matter of calling `quicksort`

to nearly-sort the array, followed by calling `insert-sort`

to complete the sort:

(define (sort lt? ary) (let ((hi (vector-length ary))) (quicksort lt? ary 0 hi) (insert-sort lt? ary 0 hi)))

Here is our version of insertion sort. This quadratic sorting algorithm has very good performance when the input is nearly sorted:

(define (insert-sort lt? ary lo hi) (do ((i (+ lo 1) (+ i 1))) ((= i hi) ary) (let ((t (vector-ref ary i))) (do ((j (- i 1) (- j 1))) ((or (< j lo) (lt? (vector-ref ary j) t)) (vector-set! ary (+ j 1) t)) (vector-set! ary (+ j 1) (vector-ref ary j))))))

We have to change both our test and timing programs:

(define (test-quicksort k n) (let ((ary (list->vector (range k)))) (do ((n n (- n 1))) ((zero? n)) (sort < ary) (unless (sorted? < ary) (display "ERROR ") (display ary) (newline)) (shuffle ary))))

(define (time-quicksort k n) (let ((ary (shuffle (list->vector (range k))))) (do ((n n (- n 1))) ((zero? n)) (sort < ary) (unless (sorted? < ary) (display "ERROR ") (display ary) (newline)) (shuffle ary))))

This saves us a bit more time:

> (time (time-quicksort 1000000 10)) (time (time-quicksort 1000000 ...)) 11 collections 5.448323257s elapsed cpu time, including 0.068221981s collecting 5.462751294s elapsed real time, including 0.068402786s collecting 95498128 bytes allocated, including 84668176 bytes reclaimed

Our last improvement is to change the partitioning element; instead of picking the first element, we choose the median of three elements. That avoids quadratic slowdown in common cases like already-sorted arrays, and even when sorting random arrays, it can improve sort times because bad partitions are less likely. However, though it reduces the likelihood of quadratic slowdown, it does not eliminate the possibility, and the worst case time complexity of quicksort remains O(n^{2}). Here is our function to compute the median of the first element in a sub-array, the middle element in a sub-array, and the last element in a sub-array:

(define (median-of-three lt? ary lo hi) (let* ((mid (quotient (+ lo hi) 2)) (lo-val (vector-ref ary lo)) (mid-val (vector-ref ary mid)) (hi-val (vector-ref ary (- hi 1)))) (if (lt? lo-val mid-val) (if (lt? mid-val hi-val) mid (if (lt? lo-val hi-val) (- hi 1) lo)) (if (lt? lo-val hi-val) lo (if (lt? mid-val hi-val) (- hi 1) mid)))))

We insert the median-of-three calculation into the quicksort algorithm by modifying the `partition`

function to swap (written inline) the median element with the first element, then continuing with the normal partitioning algorithm:

(define (partition lt? ary lo hi) (let ((part (median-of-three lt? ary lo hi))) (let ((t (vector-ref ary part))) (vector-set! ary part (vector-ref ary lo)) (vector-set! ary lo t))) (let ((x (vector-ref ary lo)) (i (- lo 1)) (j hi)) (let forever () (do-until (set! j (- j 1)) (not (lt? x (vector-ref ary j)))) (do-until (set! i (+ i 1)) (not (lt? (vector-ref ary i) x))) (when (< i j) (let ((t (vector-ref ary i))) (vector-set! ary i (vector-ref ary j)) (vector-set! ary j t)) (forever))) (values j ary)))

Timing shows yet another speed improvement:

> (time (time-quicksort 1000000 10)) (time (time-quicksort 1000000 ...)) 10 collections 5.099436403s elapsed cpu time, including 0.044687735s collecting 5.150438980s elapsed real time, including 0.044843144s collecting 86835856 bytes allocated, including 69279472 bytes reclaimed

The speedup is even greater on arrays that are already sorted, because each recursive call splits the array exactly in the middle, and Hoare’s partitioning means there are very few swaps, though quicksort is still not as fast as insertion sort in that case:

> (time (begin (sort < (list->vector (range 1000000))) 'done)) (time (begin (sort < ...) ...)) 5 collections 0.247582180s elapsed cpu time, including 0.042849217s collecting 0.247819495s elapsed real time, including 0.042979872s collecting 44197552 bytes allocated, including 17686944 bytes reclaimed done

> (time (begin (insert-sort < (list->vector (range 1000000)) 0 1000000) 'done)) (time (begin (insert-sort < ...) ...)) 5 collections 0.095537412s elapsed cpu time, including 0.045661468s collecting 0.096014903s elapsed real time, including 0.046088562s collecting 40003280 bytes allocated, including 19092608 bytes reclaimed done

From beginning to end, we reduced the time required to sort ten arrays of a million elements from 7.236 seconds to 5.099 seconds, a 30% improvement; since that timing includes other tasks than just sorting, the actual percentage improvement on sorting is even larger. But despite our improvements, the quadratic worst-case of quicksort remains. We’ll show a variant of quicksort with guaranteed O(*n* log *n*) worst-case time complexity in the next exercise.

You can run the program at http://ideone.com/z1m1K4.

Here’s a solution in C99. Compiler optimizations were disabled. Each improvement was added successively. I didn’t look at how each one improves performance in isolation.

For each experiment, I sorted 50,000 random arrays, each with 1,000 elements.

The unoptimized version took 6.506 seconds.

Adding Inlined swapping reduced the time to 5.392 seconds.

Using insertion sort for arrays with 21 or fewer items reduced the time to 4.703 seconds. I manually tuned the size of arrays for early cutoff.

Adding median-of-three for determining a pivot reduced the time to 4.653 seconds.

The following code (and corresponding output) includes all the improvements.

Output:

[…] two previous exercises we’ve been working toward a variant of quicksort that has guaranteed O(n log n) performance; […]

[…] sorting algorithm that we have been working up to in three previous exercises is introspective sort, or introsort, invented by David Musser in 1997 for the C++ […]