Run Length Encoding

February 26, 2010

A book by Brian Kernighan and P. J. Plauger, Software Tools in Pascal, describes a pair of programs, called compress and expand, that provide run-length encoding of text files. The compress program takes a text file as input and writes a (hopefully smaller) version of the text file as output; the expand program inverts that operation. Compression is achieved by replacing runs of four or more of the same character with a three-character code consisting of a tilde, a letter A through Z indicating 1 through 26 repetitions, and the character to be repeated. Runs longer than 26 characters are replaced by multiple encodings, andany literal appearance of the tilde in the input is encoded as a run of length 1. For instance, the string ABBB~CDDDDDEEEEEEEEEEEEEEEEEEEEEEEEEEEEEE is encoded as ABBB~A~C~ED~ZE~DE.

Your task is to write programs that compress and expand a text file. When you are finished, you are welcome to read or run a suggested solution, or to post your own solution or discuss the exercise in the comments below.

Posted by programmingpraxis

Filed in Exercises

4 Comments »

4 Responses to “Run Length Encoding”

Programming Praxis – Run Length Encoding « Bonsai Code said
February 26, 2010 at 9:43 AM
[…] Praxis – Run Length Encoding By Remco Niemeijer In today’s Programming Praxis exercise we have to implement a run length encoding algorithm. The provided […]

Remco Niemeijer said

February 26, 2010 at 9:44 AM

My Haskell solution (see http://bonsaicode.wordpress.com/2010/02/26/programming-praxis-run-length-encoding/ for a version with comments):

import Data.List
import Data.List.Split

compress :: String -> String
compress s = f =<< chunk 26 =<< group s where
    f xs = if length xs < 4 && take 1 xs /= "~" then xs
           else '~' : toEnum (length xs + 64) : take 1 xs

expand :: String -> String
expand []           = []
expand ('~':r:c:xs) = replicate (fromEnum r - 64) c ++ expand xs
expand (c:xs)       = c : expand xs

domor said

February 26, 2010 at 9:43 PM

My take on compress. I interpreted “any literal appearance of the tilde in the input is encoded as a run of length 1” as being a special case, so the output is different from Niemeijer’s version when there are several tildes in a row.

import Control.Arrow
import Data.List

compress :: String -> String
compress xs = group xs >>= (head &&& length >>> encode)
    where encode (c, n) | c == '~'  = concat . replicate n $ "~A~"
                        | n < 4     = replicate n c
                        | n > 26    = encode (c, 26) ++ encode (c, n-26)
                        | otherwise = '~' : toEnum (fromEnum 'A' + n - 1 ) : [c]

Full code with a quickcheck test: http://codepad.org/VakQUriW

Mike said

April 21, 2010 at 11:00 PM

Python, done using regular expressions.

Like domor, I separately encoded each ~ in the input.

from string import ascii_uppercase
import re

def compress( s ):
    def sub( mo ):
        run = mo.group()
        return "~" + ascii_uppercase[len(run)-1] + run[0]
    
    return re.sub( r"(.)\1{3,25}|~", sub, s )

def expand( s ):
    def sub( mo ):
        code = mo.group()
        return code[2]*(ascii_uppercase.index(code[1]) +1)
    
    return re.sub(r"(~..)", sub, s)


def test():
    s = 'ABBB~CDDDDDEEEEEEEEEEEEEEEEEEEEEEEEEEEEEE'
    t = 'ABBB~A~C~ED~ZE~DE'
    assert compress( s ) == t
    assert expand( t ) == s

S	M	T	W	T	F	S
	1	2	3	4	5	6
7	8	9	10	11	12	13
14	15	16	17	18	19	20
21	22	23	24	25	26	27
28

Programming Praxis