External Sorting

November 20, 2015

Our strategy is to split the file into several temporary files, sort each separately, then combine the temporary files using a heap containing one line per temporary file. Here’s the function that creates the temporary files:

(define (read-into-temps)
  (let loop ((file-num 1))
    (let ((lines (read-lines max-lines)))
      (if (null? lines) (- file-num 1)
        (let ((temp-file-name (temp file-num)))
          (with-output-to-file temp-file-name
            (lambda ()
              (do ((lines (sort string<? lines) (cdr lines)))
                  ((null? lines))
                (display (car lines)) (newline))))
          (loop (+ file-num 1)))))))

We steal much of the file-reversal code from the previous exercise: max-lines, read-lines, and temp are unchanged, and the read-into-temps function shown above differs from the file-reversal version only in the call to the sort function; that’s why we wrote the file-reversal program first.

The write-output function is considerably different than the file-reversal version. It creates a heap with one node per temporary file; each node contains an input port and the next line from the temporary file. The heap repeatedly pops the next string, then reads the next line from the temporary file that it popped and adds that line back to the heap:

(define (write-output num-files)
  (define (lt? x y) (string<? (car x) (car y)))
  (let loop ((num-files num-files) (pq pq-empty))
    (if (positive? num-files)
        (loop (- num-files 1)
              (let ((p (open-input-file (temp num-files))))
                (pq-insert lt? (cons (read-line p) p) pq)))
        (let loop ((pq pq))
          (when (not (pq-empty? pq))
            (let ((node (pq-first pq)))
              (display (car node)) (newline)
              (let ((next (read-line (cdr node))))
                (if (eof-object? next)
                    (loop (pq-rest lt? pq))
                    (loop (pq-insert
                            lt?
                            (cons next (cdr node))
                            pq))))))))))

We use pairing heaps from a previous exercise. Then we put the whole thing together like this:

(define (x-sort in-file out-file)
  (with-input-from-file in-file (lambda ()
    (with-output-to-file out-file (lambda ()
      (write-output (read-into-temps)))))))

As an example, here is the sorted Bible:

> (x-sort "bible.txt" "bbeil.txt")
> (head&tail "bbeil.txt")
1Chronicles10:1 Now the Philistines were at war with Israel; the Israelites fled before the Philistines, and a number of them fell, slain on Mount Gilboa.
Zephaniah3:9 For then I will change and purify the lips of the peoples, That they all may call upon the name of the LORD, to serve him with one accord;

You can see the entire code and run the program at http://ideone.com/UiHBxT.

Advertisements

Pages: 1 2

2 Responses to “External Sorting”

  1. matthew said

    Don’t have time to do actual implementation, but could use memory mapped file with suitable cache-oblivious sorting algorithm (https://en.wikipedia.org/wiki/Funnelsort seems to be state of the art, or at least, some kind of n-way mergesort).

  2. John Cowan said

    See Knuth on replacement selection, where you notice if the incoming line happens to be greater than the last line output to the current temp file, and if so write it out too. This is a huge win on almost-sorted files, a very common case.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: