File Reversal
November 17, 2015
Given a file containing lines of text that fits into memory, it is easy to write the file with lines in reverse order: read the lines of text into memory, then write them in reverse. It is harder to reverse the lines of a file if the file is too big to fit into memory.
Your task is to write a program that reverses the lines of a text file that is too big to fit in memory. When you are finished you are welcome to read or run a suggested solution, or to post your own solution or discuss the exercise in the comments below.
I tried to memory-map a 5.3-gigabyte, 85-million-line file in a 32-bit computer. Memory-mapping turned out to have a size limit below 2^31 bytes (signed address with some administrative space). Another parameter allowed starting the mapping at an offset, so I gained instant access to the end of the file after all. Results agreed with tac. While tac, head and tail conspired to reverse-print a million lines in a second or two, my tac.jl (julia 0.4.0) took something like a minute. Not bad for such a naive program, I think. Dealing with the the line at the offset might not be hard, either, but let it be now :)
Memory mapping is a nice idea. Here’s another way of dealing with large files on 32-bit systems: files don’t have to be written sequentially, on Unix systems anyway, so we can just write the lines from input file in reverse order in the output file. If we compile for large files, we can use 64-bit offsets, even on 32-bit systems. Seems to work on my Raspberry Pi, though that took 54 minutes to reverse the first 2.2GB of a 5.5GB file before running out of disk (it’s got a 16 GB SD card, which can be written at about 10MB a second, so it’s not write speed that’s the problem, maybe just too many system calls). As usual, for brevity, error checks have been removed.
Here’s an improved version that buffers up lines before writing to the output file. Time on Pi for 2.2GB data is now about 8mins, just copying same data with cat takes about 6 mins. I’ve left the checks in for this version (I tend to use assert for runtime checks in small programs like this, don’t do this at work).
I created a custom std::streambuf to read a file in reverse. Since std::streambuf doesn’t seem to have provisions for reading through a buffer backwards, I simply reverse whatever gets read into the buffer. So, this is not an efficient solution.
While std::streambuf may not be great, deriving from it allows me to then use std::getline() to read lines from the file. (And gives me something reusable that will interop with code designed to work with standard streams.) These lines, however, will be in reverse order, so I have to reverse each line again before output.
Heh. Forgot to delete my comment about adding error checking. Tests showed, however, that if you give it a bogus file name the whole thing just becomes a no-op. Which is typical for iostreams. They won’t throw exceptions unless you ask them too (and then they throw too much), but they generally won’t crash if you don’t bother to explicitly check for errors.