Statistics is Easy!

(For a Swedish translation of this page, please click here.)

Statistics is the activity of inferring results about a population given a sample. Historically, statistics books assume an underlying distribution to the data (typically, the normal distribution) and derive results under that assumption. Unfortunately, in real life, one cannot normally be sure of the underlying distribution. For that reason, this book presents a distribution-independent approach to statistics based on a simple computational counting idea called resampling.

This book explains the basic concepts of resampling, then systematically presents the standard statistical measures along with programs (in the language Python) to calculate them using resampling, and finally illustrates the use of the measures and programs in a case study. The text uses junior high school algebra and many examples to explain the concepts. The ideal reader has mastered at least elementary mathematics, likes to think procedurally, and is comfortable with computers.

The Basic Idea

Suppose you want to know whether some coin is fair^{1}. You toss it 17 times and it comes up heads all but 2 times. How might you determine whether it is reasonable to believe the coin is fair? (A fair coin should come up heads with probability 1/2 and tails with probability 1/2.) You could ask to compute the percentage of times that you would get this result if the fairness assumption were true. Probability theory would suggest using the binomial distribution. But you may have forgotten the formula or the derivation. So you might look it up or at least remember the name so you could get software to do it. The net effect is that you wouldn't understand much, unless you were up on your probability theory.

The alternative is to do an experiment 10,000 times, where the experiment consists of tossing a coin that is known to be fair 17 times and ask what percentage of times you get heads 15 times or more. When we ran this program, the percentage was consistently well under 5 (that is, under 5%, a result often used to denote "unlikely"), so it's unlikely the coin is in fact fair. Your hand might ache from doing this, but your PC will do this in under a second.

Here is an example run of this code:

```
9 out of 10000 times we got at least 15 heads in 17 tosses.
Probability that chance alone gave us at least 15 heads in 17 tosses is 0.0009 .
```

Here is a second example.

Imagine we have given some people a placebo and others a drug. The measured improvement (the more positive the better) is

Placebo: 54 51 58 44 55 52 42 47 58 46

Drug: 54 73 53 70 73 68 52 65 65

As you can see, the drug seems more effective on the average (the average measured improvement is 63.7 for the drug and 50.7 for the placebo). But is this difference in the average real? Formula-based statistics would use a t-test which entails certain assumptions about normality and variance, but we are going to look just at the samples themselves and shuffle the labels.

What this means can be illustrated as follows. We put all the people in a table having two columns value and label (P for placebo and D for drug).

value | label |
---|---|

54 | P |

51 | P |

58 | P |

44 | P |

55 | P |

52 | P |

42 | P |

47 | P |

58 | P |

46 | P |

54 | D |

73 | D |

53 | D |

70 | D |

73 | D |

68 | D |

52 | D |

65 | D |

65 | D |

Shuffling the labels means that we will take the P's and D's and randomly distribute them among the patients. (Technically, we do a uniform random permutation of the label column.)

This might give:

value | label |
---|---|

54 | P |

51 | P |

58 | D |

44 | P |

55 | P |

52 | D |

42 | D |

47 | D |

58 | D |

46 | D |

54 | P |

73 | P |

53 | P |

70 | D |

73 | P |

68 | P |

52 | D |

65 | P |

65 | D |

We can then look at the difference in the average P value vs. the average D value here. We get an average of 59.0 for P and 54.4 for D. We repeat this shuffle-then-measure procedure 10,000 times and ask what fraction of time we get a difference between drug and placebo greater than or equal to the measured difference of 63.7 - 50.7 = 13. The answer in this case is under 0.001. That is less than 0.1%. So we conclude that the difference between the averages of the samples is real. This is what statisticians call significant.

Let's step back for a moment. What is the justification for shuffling the labels? The idea is simply this: if the drug had no real effect, then the placebo would often give more improvement than the drug. By shuffling the labels, we are simulating the situation in which some placebo measurements replace some drug measurements. If the observed average difference of 13 would be matched or even exceeded in many of these shufflings, then the drug might have no effect beyond the placebo. That is, the observed difference could have occurred by chance.