File Reversal

November 17, 2015

We set the maximum number of lines at 10000, though you might prefer something else in your environment:

(define max-lines 10000)

Then we read the input into a series of temporary files, with lines in input order; the reversal will be done on output:

(define (read-into-temps)
  (let loop ((file-num 1))
    (let ((lines (read-lines max-lines)))
      (if (null? lines) (- file-num 1)
        (let ((temp-file-name (temp file-num)))
          (with-output-to-file temp-file-name
            (lambda ()
              (do ((lines lines (cdr lines)))
                  ((null? lines))
                (display (car lines)) (newline))))
          (loop (+ file-num 1)))))))

All of the fun is in creating the output; we read the temporary files in reverse order and output the lines from each temporary file in reverse order:

(define (write-output num-files)
  (let loop ((num num-files))
    (when (positive? num)
      (with-input-from-file (temp num)
        (lambda ()
          (do ((lines (reverse (read-lines -1)) (cdr lines)))
              ((null? lines))
            (display (car lines)) (newline))))
      (loop (- num 1)))))

A driver program connects the pieces:

(define (reverse-file in-file out-file)
  (with-input-from-file in-file (lambda ()
    (with-output-to-file out-file (lambda ()
      (write-output (read-into-temps)))))))

And that’s it. Our sample program reverses the Bible, stored as one line per verse with a prefix tag, and writes the first and last line with a function from a previous exercise:

> (reverse-file "bible.txt" "txt.elbib")
> (head&tail "txt.elbib")
Revelation22:21 The grace of the Lord Jesus be with all.
Genesis1:1 In the beginning, when God created the heavens and the earth,

You can see the complete program, including auxiliary functions, at http://ideone.com/74ODHo.

Advertisement

Pages: 1 2

5 Responses to “File Reversal”

  1. Jussi Piitulainen said

    I tried to memory-map a 5.3-gigabyte, 85-million-line file in a 32-bit computer. Memory-mapping turned out to have a size limit below 2^31 bytes (signed address with some administrative space). Another parameter allowed starting the mapping at an offset, so I gained instant access to the end of the file after all. Results agreed with tac. While tac, head and tail conspired to reverse-print a million lines in a second or two, my tac.jl (julia 0.4.0) took something like a minute. Not bad for such a naive program, I think. Dealing with the the line at the offset might not be hard, either, but let it be now :)

    """Displays some lines of a named file,
       which is over 2^32 bytes in size,
       from the end in reverse order.
       Lines are in UTF-8 and end in 0x0a."""
    
    function peek(name)
        s = Mmap.mmap(name, Array{UInt8,1},
                      filesize(name) - Int64(2)^32, # < 2^31 - some
                      Int64(2)^32)                  # offset
        e = endof(s)
        for k in 1:10
            b = findprev(s, 0x0a, e - 1) + 1
            print(utf8(s[b:e]))
            e = b - 1
        end
    end
    
    peek(ARGS[1])
    
  2. matthew said

    Memory mapping is a nice idea. Here’s another way of dealing with large files on 32-bit systems: files don’t have to be written sequentially, on Unix systems anyway, so we can just write the lines from input file in reverse order in the output file. If we compile for large files, we can use 64-bit offsets, even on 32-bit systems. Seems to work on my Raspberry Pi, though that took 54 minutes to reverse the first 2.2GB of a 5.5GB file before running out of disk (it’s got a 16 GB SD card, which can be written at about 10MB a second, so it’s not write speed that’s the problem, maybe just too many system calls). As usual, for brevity, error checks have been removed.

    #define _FILE_OFFSET_BITS 64
    #include <string.h>
    #include <stdio.h>
    #include <unistd.h>
    #include <fcntl.h>
    #include <sys/stat.h>
    
    int main(int argc, char *argv[])
    {
      char *infile = argv[1];
      char *outfile = argv[2];
      int in = open(infile,O_RDONLY);
      int out = creat(outfile, 0666);
      struct stat statbuf;
      fstat(in,&statbuf);
      off_t pos = statbuf.st_size;
      FILE *fin = fdopen(in, "r");
      char line[2048];
      while(fgets(line,2048,fin)) {
        size_t len = strlen(line);
        pos -= len;
        lseek(out, pos, SEEK_SET);
        write(out,line,len);
      }
      fclose(fin); close(out);
    }
    
  3. matthew said

    Here’s an improved version that buffers up lines before writing to the output file. Time on Pi for 2.2GB data is now about 8mins, just copying same data with cat takes about 6 mins. I’ve left the checks in for this version (I tend to use assert for runtime checks in small programs like this, don’t do this at work).

    #define _FILE_OFFSET_BITS 64
    #include <assert.h>
    #include <string.h>
    #include <stdio.h>
    #include <unistd.h>
    #include <fcntl.h>
    #include <sys/stat.h>
    
    int main(int argc, char *argv[])
    {
      char *infile = argv[1];
      char *outfile = argv[2];
      int in = open(infile,O_RDONLY);
      int out = creat(outfile, 0666);
      assert(in >= 0);
      assert(out >= 0);
      struct stat statbuf;
      assert(fstat(in,&statbuf) == 0);
      assert(sizeof(statbuf.st_size) == 8);
      off_t pos = statbuf.st_size;
      FILE *fin = fdopen(in, "r");
      size_t LSIZE = 1024;
      size_t BSIZE = 1024*1024;
      char line[LSIZE];
      char buff[BSIZE];
      size_t bp = BSIZE;
      while(fgets(line,LSIZE,fin)) {
        size_t len = strlen(line);
        bp -= len;
        memcpy(buff+bp,line,len);
        if (bp < LSIZE) {
          pos -= BSIZE-bp;
          assert(lseek(out, pos, SEEK_SET) != (off_t)-1);
          assert(write(out,buff+bp,BSIZE-bp) == (ssize_t)(BSIZE-bp));
          bp = BSIZE;
        }
      }
      assert((off_t)(BSIZE-bp) == pos);
      lseek(out,0,SEEK_SET);
      write(out,buff+bp,BSIZE-bp);
      fclose(fin); close(out);
    }
    
  4. fisherro said

    I created a custom std::streambuf to read a file in reverse. Since std::streambuf doesn’t seem to have provisions for reading through a buffer backwards, I simply reverse whatever gets read into the buffer. So, this is not an efficient solution.

    While std::streambuf may not be great, deriving from it allows me to then use std::getline() to read lines from the file. (And gives me something reusable that will interop with code designed to work with standard streams.) These lines, however, will be in reverse order, so I have to reverse each line again before output.

    #include <algorithm>
    #include <array>
    #include <cstdio>
    #include <fstream>
    #include <iostream>
    #include <streambuf>
    #include <string>
    #include <vector>
    
    //A streambuffer that reads a file in reverse.
    //I'd like to actually walk backwards through the buffer.
    //But to keep things easy, I'm going to reverse the data after I read it.
    class Reverse_file_buf: public std::streambuf {
    public:
        static size_t calc_pbs(size_t requested) {
            return std::max(1UL, std::min(1024UL/2UL, requested));
        }
    
        explicit Reverse_file_buf(std::string path, size_t put_back_size = 1)
            : pbs(calc_pbs(put_back_size)), file(path)
        {
            //Check that file is good?
            file.seekg(0, std::ios_base::end);
            pos = file.tellg();
            auto e = buffy.data() + buffy.size();
            setg(e, e, e);
        }
    
    private:
        virtual int_type underflow() override
        {
            if (gptr() >= egptr()) {
                if (0 == pos) return traits_type::eof();
                auto p = buffy.data() + pbs;
                size_t size = buffy.size() - pbs;
                auto to_read = std::min(size, pos);
                pos -= to_read;
                file.seekg(pos);
                file.read(p, to_read);
                auto n = file.gcount();
                if (n < 1) return traits_type::eof();
                std::reverse(p, p + n);
                setg(buffy.data(), p, p + n);
            }
            return traits_type::to_int_type(*gptr());
        }
    
        const size_t pbs;
        std::array<char, 1024> buffy;
        size_t pos;
        std::ifstream file;
    };
    
    int main(int argc, char** argv)
    {
        std::vector<std::string> args(argv + 1, argv + argc);
        if (args.empty()) args.push_back("praxis.cpp");
        for (auto arg: args) {
            Reverse_file_buf rfb(arg);
            std::istream in(&rfb);
            std::string line;
            while (std::getline(in, line)) {
                std::reverse(line.begin(), line.end());
                std::cout << line << '\n';
            }
        }
    }
    
  5. fisherro said

    Heh. Forgot to delete my comment about adding error checking. Tests showed, however, that if you give it a bogus file name the whole thing just becomes a no-op. Which is typical for iostreams. They won’t throw exceptions unless you ask them too (and then they throw too much), but they generally won’t crash if you don’t bother to explicitly check for errors.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: