Calculating Statistics

October 4, 2013

This looks like an assignment from a beginning programming class. I expect most beginning programmers would read the input into an array and compute the various statistics after the input is exhausted. Instead, we will read each line and build a frequency table of int/count pairs in an avl tree, then look up the various statistics in the tree:

(define (read-file file-name) (with-input-from-file file-name (lambda () (let ((t (make-dict <))) (do ((x (read) (read))) ((eof-object? x) t) (set! t (t 'update (lambda (k v) (+ v 1)) x 1)))))))

(define (count t) (t 'fold (lambda (k v b) (+ v b)) 0))

(define (sum t) (t 'fold (lambda (k v b) (+ (* k v) b)) 0))

(define (mean t) (/ (sum t) (count t)))

(define (median t) (let ((m (car (t 'nth 0))) (med (/ (count t) 2))) (t 'fold (lambda (k v b) (when (< (+ v b) med) (set! m k)) (+ v b)) 0) m))

(define (mode t) (let* ((x (t 'nth 0)) (m-key (car x)) (m-val (cdr x))) (t 'for-each (lambda (k v) (when (< m-val v) (set! m-key k) (set! m-val v)))) m-key))

(define (minimum t) (car (t 'nth 0)))

(define (maximum t) (car (t 'nth (- (t 'size) 1))))

Here we set up a test file and compute some statistics:

> (with-output-to-file "test-stats" (lambda () (do ((i 25 (- i 1))) ((zero? i)) (display (randint 10)) (newline)))) > (define stats (read-file "test-stats")) > (stats 'to-list) ((0 . 1) (1 . 3) (2 . 2) (3 . 3) (4 . 1) (5 . 2) (6 . 3) (7 . 1) (8 . 5) (9 . 4)) > (count stats) 25 > (sum stats) 131 > (mean stats) 131/25 > (median stats) 5 > (mode stats) 8 > (minimum stats) 0 > (maximum stats) 9

We used avl trees from the Standard Prelude, and randint for testing. You can see the program at http://programmingpraxis.codepad.org/8gUf5Ye7.

Posted by programmingpraxis

Filed in Exercises

4 Comments »

4 Responses to “Calculating Statistics”

axiac said
October 4, 2013 at 9:59 AM
The median for the list of numbers in the example is 6, not 5.

Paul said

October 4, 2013 at 10:27 AM

In Python.For the median I use: middle element if odd number of elements, else average of middle 2 elements.

from __future__ import division

from collections import Counter, namedtuple
Stats = namedtuple("Stats", ['n', 'sum', 'mean', 'median', 'mode', 'min', 
                             'max'])
def stats(fname=None, data=None):
    """Use filename, if provided, else use data"""
    if fname:
        data = [int(line) for line in open(fname)]
    data = sorted(data)
    if not data:
        return Stats(0, None,  None,  None,  None,  None,  None)
    nr_data = len(data)
    sum_data = sum(data)
    mean_data = sum_data / nr_data
    mode_data = Counter(data).most_common(1)[0][0]
    median_data = (data[nr_data // 2] if nr_data & 1 else 
        (data[nr_data // 2] + data[nr_data // 2 + 1]) / 2)
    return Stats(nr_data, sum_data, mean_data, median_data, mode_data, data[0],
            data[-1])

print stats(data=[1,2,3,4,5,7,8,2,3,6,7,1,2,3,4,9,8,6,7,3,4,6,7,4,5,4,3,2,2,5])
# Stats(n=30, sum=133, mean=4.433333333333334, median=4.0, mode=2, min=1, max=9)

blogthebuddha said

October 4, 2013 at 5:53 PM

Nothing special. Ruby.

#!/usr/bin/env ruby

if $PROGRAM_NAME == __FILE__
  frequency = Hash.new(0)
  median_sort = []
  File.open('whatever.txt') do |f|
    f.each do |line|
      number = line.to_i
      frequency[number] += 1
      median_sort << number
    end
  end

  # compute the number of integers
  size = median_sort.size

  # computer their sum
  sum = 0
  median_sort.each { |i| sum += i}

  # compute the mean
  mean = sum.to_f / size

  # compute the mode
  mode_pick = frequency.sort_by { |k,v| v }
  mode = mode_pick[-1][0]

  # compute the min and the max
  min_max = median_sort.sort
  min = min_max[0]
  max = min_max[-1]

  # compute the median
  half = (size.to_f - 1 )/ 2
  unless half%2 == 0
    half1 = half.floor
    half2 = half.ceil
  end

  median = min_max[half.to_i] if half % 2 == 0
  median = (min_max[half1] + min_max[half2]) / 2.0 unless half % 2 == 0

  puts "size: #{size}"
  puts "sum: #{sum}"
  puts "min: #{min}"
  puts "max: #{max}"
  puts "mean: #{mean}"
  puts "median: #{median}"
  puts "mode: #{mode}"
end

Brendan White said
October 10, 2013 at 4:30 AM
It’s tricky to determine the “best” approach without some knowledge of the expected data.

Only a few dozen values, and all values are less than a thousand? Just read them all into an array of regular old integers, and do the math afterwards. The code will be simple, easy to read, and easy to maintain.

Tens of millions of values? You might want to do something more clever, like keep a running total of count, total, min and max, and then do some hashtables for frequency (for the mode) and .. I don’t know, maybe some clever post-processing with the frequency tables for the median.

Values could be > 4,294,967,295? Then you need to do some careful thinking. :0)

But if you can be reasonably confident the values will be small enough (and few enough) that their sum fits comfortably in a regular old integer, and if the code is unlikely to be run more than a few dozen times on any given day, then I’d suggest the seasoned veteran programmer would take the simple approach and reduce the maintenance burden.

S	M	T	W	T	F	S
		1	2	3	4	5
6	7	8	9	10	11	12
13	14	15	16	17	18	19
20	21	22	23	24	25	26
27	28	29	30	31

Programming Praxis