External Sorting
November 20, 2015
Our strategy is to split the file into several temporary files, sort each separately, then combine the temporary files using a heap containing one line per temporary file. Here’s the function that creates the temporary files:
(define (read-into-temps) (let loop ((file-num 1)) (let ((lines (read-lines max-lines))) (if (null? lines) (- file-num 1) (let ((temp-file-name (temp file-num))) (with-output-to-file temp-file-name (lambda () (do ((lines (sort string<? lines) (cdr lines))) ((null? lines)) (display (car lines)) (newline)))) (loop (+ file-num 1)))))))
We steal much of the file-reversal code from the previous exercise: max-lines
, read-lines
, and temp
are unchanged, and the read-into-temps
function shown above differs from the file-reversal version only in the call to the sort function; that’s why we wrote the file-reversal program first.
The write-output
function is considerably different than the file-reversal version. It creates a heap with one node per temporary file; each node contains an input port and the next line from the temporary file. The heap repeatedly pops the next string, then reads the next line from the temporary file that it popped and adds that line back to the heap:
(define (write-output num-files) (define (lt? x y) (string<? (car x) (car y))) (let loop ((num-files num-files) (pq pq-empty)) (if (positive? num-files) (loop (- num-files 1) (let ((p (open-input-file (temp num-files)))) (pq-insert lt? (cons (read-line p) p) pq))) (let loop ((pq pq)) (when (not (pq-empty? pq)) (let ((node (pq-first pq))) (display (car node)) (newline) (let ((next (read-line (cdr node)))) (if (eof-object? next) (loop (pq-rest lt? pq)) (loop (pq-insert lt? (cons next (cdr node)) pq))))))))))
We use pairing heaps from a previous exercise. Then we put the whole thing together like this:
(define (x-sort in-file out-file) (with-input-from-file in-file (lambda () (with-output-to-file out-file (lambda () (write-output (read-into-temps)))))))
As an example, here is the sorted Bible:
> (x-sort "bible.txt" "bbeil.txt") > (head&tail "bbeil.txt") 1Chronicles10:1 Now the Philistines were at war with Israel; the Israelites fled before the Philistines, and a number of them fell, slain on Mount Gilboa. Zephaniah3:9 For then I will change and purify the lips of the peoples, That they all may call upon the name of the LORD, to serve him with one accord;
You can see the entire code and run the program at http://ideone.com/UiHBxT.
Don’t have time to do actual implementation, but could use memory mapped file with suitable cache-oblivious sorting algorithm (https://en.wikipedia.org/wiki/Funnelsort seems to be state of the art, or at least, some kind of n-way mergesort).
See Knuth on replacement selection, where you notice if the incoming line happens to be greater than the last line output to the current temp file, and if so write it out too. This is a huge win on almost-sorted files, a very common case.