File Reversal

November 17, 2015

Given a file containing lines of text that fits into memory, it is easy to write the file with lines in reverse order: read the lines of text into memory, then write them in reverse. It is harder to reverse the lines of a file if the file is too big to fit into memory.

Your task is to write a program that reverses the lines of a text file that is too big to fit in memory. When you are finished you are welcome to read or run a suggested solution, or to post your own solution or discuss the exercise in the comments below.

Pages: 1 2

5 Responses to “File Reversal”

  1. Jussi Piitulainen said

    I tried to memory-map a 5.3-gigabyte, 85-million-line file in a 32-bit computer. Memory-mapping turned out to have a size limit below 2^31 bytes (signed address with some administrative space). Another parameter allowed starting the mapping at an offset, so I gained instant access to the end of the file after all. Results agreed with tac. While tac, head and tail conspired to reverse-print a million lines in a second or two, my tac.jl (julia 0.4.0) took something like a minute. Not bad for such a naive program, I think. Dealing with the the line at the offset might not be hard, either, but let it be now :)

    """Displays some lines of a named file,
       which is over 2^32 bytes in size,
       from the end in reverse order.
       Lines are in UTF-8 and end in 0x0a."""
    
    function peek(name)
        s = Mmap.mmap(name, Array{UInt8,1},
                      filesize(name) - Int64(2)^32, # < 2^31 - some
                      Int64(2)^32)                  # offset
        e = endof(s)
        for k in 1:10
            b = findprev(s, 0x0a, e - 1) + 1
            print(utf8(s[b:e]))
            e = b - 1
        end
    end
    
    peek(ARGS[1])
    
  2. matthew said

    Memory mapping is a nice idea. Here’s another way of dealing with large files on 32-bit systems: files don’t have to be written sequentially, on Unix systems anyway, so we can just write the lines from input file in reverse order in the output file. If we compile for large files, we can use 64-bit offsets, even on 32-bit systems. Seems to work on my Raspberry Pi, though that took 54 minutes to reverse the first 2.2GB of a 5.5GB file before running out of disk (it’s got a 16 GB SD card, which can be written at about 10MB a second, so it’s not write speed that’s the problem, maybe just too many system calls). As usual, for brevity, error checks have been removed.

    #define _FILE_OFFSET_BITS 64
    #include <string.h>
    #include <stdio.h>
    #include <unistd.h>
    #include <fcntl.h>
    #include <sys/stat.h>
    
    int main(int argc, char *argv[])
    {
      char *infile = argv[1];
      char *outfile = argv[2];
      int in = open(infile,O_RDONLY);
      int out = creat(outfile, 0666);
      struct stat statbuf;
      fstat(in,&statbuf);
      off_t pos = statbuf.st_size;
      FILE *fin = fdopen(in, "r");
      char line[2048];
      while(fgets(line,2048,fin)) {
        size_t len = strlen(line);
        pos -= len;
        lseek(out, pos, SEEK_SET);
        write(out,line,len);
      }
      fclose(fin); close(out);
    }
    
  3. matthew said

    Here’s an improved version that buffers up lines before writing to the output file. Time on Pi for 2.2GB data is now about 8mins, just copying same data with cat takes about 6 mins. I’ve left the checks in for this version (I tend to use assert for runtime checks in small programs like this, don’t do this at work).

    #define _FILE_OFFSET_BITS 64
    #include <assert.h>
    #include <string.h>
    #include <stdio.h>
    #include <unistd.h>
    #include <fcntl.h>
    #include <sys/stat.h>
    
    int main(int argc, char *argv[])
    {
      char *infile = argv[1];
      char *outfile = argv[2];
      int in = open(infile,O_RDONLY);
      int out = creat(outfile, 0666);
      assert(in >= 0);
      assert(out >= 0);
      struct stat statbuf;
      assert(fstat(in,&statbuf) == 0);
      assert(sizeof(statbuf.st_size) == 8);
      off_t pos = statbuf.st_size;
      FILE *fin = fdopen(in, "r");
      size_t LSIZE = 1024;
      size_t BSIZE = 1024*1024;
      char line[LSIZE];
      char buff[BSIZE];
      size_t bp = BSIZE;
      while(fgets(line,LSIZE,fin)) {
        size_t len = strlen(line);
        bp -= len;
        memcpy(buff+bp,line,len);
        if (bp < LSIZE) {
          pos -= BSIZE-bp;
          assert(lseek(out, pos, SEEK_SET) != (off_t)-1);
          assert(write(out,buff+bp,BSIZE-bp) == (ssize_t)(BSIZE-bp));
          bp = BSIZE;
        }
      }
      assert((off_t)(BSIZE-bp) == pos);
      lseek(out,0,SEEK_SET);
      write(out,buff+bp,BSIZE-bp);
      fclose(fin); close(out);
    }
    
  4. fisherro said

    I created a custom std::streambuf to read a file in reverse. Since std::streambuf doesn’t seem to have provisions for reading through a buffer backwards, I simply reverse whatever gets read into the buffer. So, this is not an efficient solution.

    While std::streambuf may not be great, deriving from it allows me to then use std::getline() to read lines from the file. (And gives me something reusable that will interop with code designed to work with standard streams.) These lines, however, will be in reverse order, so I have to reverse each line again before output.

    #include <algorithm>
    #include <array>
    #include <cstdio>
    #include <fstream>
    #include <iostream>
    #include <streambuf>
    #include <string>
    #include <vector>
    
    //A streambuffer that reads a file in reverse.
    //I'd like to actually walk backwards through the buffer.
    //But to keep things easy, I'm going to reverse the data after I read it.
    class Reverse_file_buf: public std::streambuf {
    public:
        static size_t calc_pbs(size_t requested) {
            return std::max(1UL, std::min(1024UL/2UL, requested));
        }
    
        explicit Reverse_file_buf(std::string path, size_t put_back_size = 1)
            : pbs(calc_pbs(put_back_size)), file(path)
        {
            //Check that file is good?
            file.seekg(0, std::ios_base::end);
            pos = file.tellg();
            auto e = buffy.data() + buffy.size();
            setg(e, e, e);
        }
    
    private:
        virtual int_type underflow() override
        {
            if (gptr() >= egptr()) {
                if (0 == pos) return traits_type::eof();
                auto p = buffy.data() + pbs;
                size_t size = buffy.size() - pbs;
                auto to_read = std::min(size, pos);
                pos -= to_read;
                file.seekg(pos);
                file.read(p, to_read);
                auto n = file.gcount();
                if (n < 1) return traits_type::eof();
                std::reverse(p, p + n);
                setg(buffy.data(), p, p + n);
            }
            return traits_type::to_int_type(*gptr());
        }
    
        const size_t pbs;
        std::array<char, 1024> buffy;
        size_t pos;
        std::ifstream file;
    };
    
    int main(int argc, char** argv)
    {
        std::vector<std::string> args(argv + 1, argv + argc);
        if (args.empty()) args.push_back("praxis.cpp");
        for (auto arg: args) {
            Reverse_file_buf rfb(arg);
            std::istream in(&rfb);
            std::string line;
            while (std::getline(in, line)) {
                std::reverse(line.begin(), line.end());
                std::cout << line << '\n';
            }
        }
    }
    
  5. fisherro said

    Heh. Forgot to delete my comment about adding error checking. Tests showed, however, that if you give it a bogus file name the whole thing just becomes a no-op. Which is typical for iostreams. They won’t throw exceptions unless you ask them too (and then they throw too much), but they generally won’t crash if you don’t bother to explicitly check for errors.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: