Entropy
January 22, 2016
The Shannon entropy of a file is a measure of the information content in the file; higher entropy implies more information. Shannon entropy is computed as H = -1 * sum(pi * log2(pi)) where pi is the frequency of each symbol i in the input (frequency is the percentage of the total number of symbols). Shannon entropy is just one facet of the study of information theory, which is fundamental to computer science.
Your task is to compute the Shannon entropy of a file. When you are finished, you are welcome to read or run a suggested solution, or to post your own solution or discuss the exercise in the comments below.
Python has more entropy!
Other Scheme version over arbitrary sorted lists
(define (count-occurences l)
(define (count l last lastcount)
(cond
((null? l) ‘())
((equal? (car l) last) (count (cdr l) last (add1 lastcount)))
(else (cons lastcount (count (cdr l) (car l) 1)))))
(if (null? l) ‘() (count (cdr l) (car l) 1)))
(define (shannon-entropy l)
(let ((len (length l)))
(* -1
(fold
(lambda (occ last)
(let ((pi (/ occ len)))
(+ last (* pi
(/ (log pi) (log 2))))))
0
(count-occurences l)))))
(shannon-entropy (sort (string->list “Hello mine-turtle!”) char 3.29883055667735
(shannon-entropy ‘(red red blue black yellow purple))
;; -> 1.8208020839343
My chicken-scheme slightly modified version
(use vector-lib)
(define (log2 x) (/ (log x) (log 2)))
(define (filefreqs filename)
(with-input-from-file filename (lambda ()
(let ((freqs (make-vector 256 0)))
(let loop ((c (read-char))
(len 0))
(if (eof-object? c)
(vector-map (lambda (i x) (/ x len)) freqs)
(let ((i (char->integer c)))
(vector-set! freqs i (+ (vector-ref freqs i) 1))
(loop (read-char) (+ 1 len)))))))))
(define (entropy filename)
(- (vector-fold (lambda (i prev x)
(+ prev (* x (if (positive? x) (log2 x) 0))))
0
(filefreqs filename))))
(display (entropy "entropy.scm"))
Does formatting work?
Haskell, using the nice little foldl library:
Haskell has slightly less entropy than Python. Computed from my Python when the file contained just the source code, Jared’s Haskell, and Praxis’ Scheme:
Almost got by without using a variable: