## Statistics

### September 27, 2011

In today’s exercise we calculate some of the basic measures in statistics: mean, standard deviation, linear regression, and correlation. The only hard part is that different sources use different standard names to refer to the different statistics. The formulas are shown below; all the summations are over $i$ from 1 to the number of items $n$:

mean:

standard deviation:

linear regression:

slope:

intercept:

correlation:

Your task is to write functions to compute these basic statistics. When you are finished, you are welcome to read or run a suggested solution, or to post your own solution or discuss the exercise in the comments below.

Advertisements

Pages: 1 2

The implementation of your standard deviation (and thus correlation) is wrong, given the definitions on page 1. Your definition says to divide by N, you divide by N – 1…

My implementation in Go:

I think I was taught to divide by n – 1 when the deviation from the (unknown) population mean is wanted but the (known) sample mean is used instead in the formula. The sample values are said to lose one “degree of freedom” because they can not all deviate freely from their own mean.

See: http://en.wikipedia.org/wiki/Standard_deviation

If you divide by n, the standard deviation is biased. Dividing by n-1 gives an unbiased standard deviation.

Python solution

http://pastebin.com/vrV9J4vN

By way of conversation, here is an approach I find much fun. I lift constants to be vecs (indexed sequences) so that everything is uniform, and then I map binary or unary operations on these vecs. Like in the language of R but more rigidly and in Scheme. The goal is a special language that allows to explore descriptions like “the mean square deviation from the mean” in the code itself. Someone should write The Structure and Interpretation of Statistical, er, Something.

Ok, I get carried away. A variation on the theme anyway. I’ve included one of Anscombe’s cases.

@Jussi Piitulainen, Paul Hofstra:

Yes, but that’s not how he defined standard deviation on page 1. Thus the confusion..

@DGel: Yes. It may be better to deviate from the definition on page 1, especially when even the model implementation does so.

How ugly :)