Calculating Statistics
October 4, 2013
In today’s exercise we will do somebody’s homework:
Read a file containing integers, one integer per line. At the end of the file, write the number of integers in the file, their sum, and the mean, median, mode, minimum and maximum of the integers in the file.
Your task is to write the indicated program. When you are finished, you are welcome to read or run a suggested solution, or to post your own solution or discuss the exercise in the comments below.
The median for the list of numbers in the example is 6, not 5.
In Python.For the median I use: middle element if odd number of elements, else average of middle 2 elements.
from __future__ import division from collections import Counter, namedtuple Stats = namedtuple("Stats", ['n', 'sum', 'mean', 'median', 'mode', 'min', 'max']) def stats(fname=None, data=None): """Use filename, if provided, else use data""" if fname: data = [int(line) for line in open(fname)] data = sorted(data) if not data: return Stats(0, None, None, None, None, None, None) nr_data = len(data) sum_data = sum(data) mean_data = sum_data / nr_data mode_data = Counter(data).most_common(1)[0][0] median_data = (data[nr_data // 2] if nr_data & 1 else (data[nr_data // 2] + data[nr_data // 2 + 1]) / 2) return Stats(nr_data, sum_data, mean_data, median_data, mode_data, data[0], data[-1]) print stats(data=[1,2,3,4,5,7,8,2,3,6,7,1,2,3,4,9,8,6,7,3,4,6,7,4,5,4,3,2,2,5]) # Stats(n=30, sum=133, mean=4.433333333333334, median=4.0, mode=2, min=1, max=9)Nothing special. Ruby.
#!/usr/bin/env ruby if $PROGRAM_NAME == __FILE__ frequency = Hash.new(0) median_sort = [] File.open('whatever.txt') do |f| f.each do |line| number = line.to_i frequency[number] += 1 median_sort << number end end # compute the number of integers size = median_sort.size # computer their sum sum = 0 median_sort.each { |i| sum += i} # compute the mean mean = sum.to_f / size # compute the mode mode_pick = frequency.sort_by { |k,v| v } mode = mode_pick[-1][0] # compute the min and the max min_max = median_sort.sort min = min_max[0] max = min_max[-1] # compute the median half = (size.to_f - 1 )/ 2 unless half%2 == 0 half1 = half.floor half2 = half.ceil end median = min_max[half.to_i] if half % 2 == 0 median = (min_max[half1] + min_max[half2]) / 2.0 unless half % 2 == 0 puts "size: #{size}" puts "sum: #{sum}" puts "min: #{min}" puts "max: #{max}" puts "mean: #{mean}" puts "median: #{median}" puts "mode: #{mode}" endIt’s tricky to determine the “best” approach without some knowledge of the expected data.
Only a few dozen values, and all values are less than a thousand? Just read them all into an array of regular old integers, and do the math afterwards. The code will be simple, easy to read, and easy to maintain.
Tens of millions of values? You might want to do something more clever, like keep a running total of count, total, min and max, and then do some hashtables for frequency (for the mode) and .. I don’t know, maybe some clever post-processing with the frequency tables for the median.
Values could be > 4,294,967,295? Then you need to do some careful thinking. :0)
But if you can be reasonably confident the values will be small enough (and few enough) that their sum fits comfortably in a regular old integer, and if the code is unlikely to be run more than a few dozen times on any given day, then I’d suggest the seasoned veteran programmer would take the simple approach and reduce the maintenance burden.