[FOM] 2 problems in the foundations of statistics

Fri Jul 29 22:19:16 EDT 2005

One of the fundamental problems in statistical inference is the
following:

A process with 2 possible outcomes (which we will label "success" and
"failure" although no asymmetry is intended) has been observed, in a+b
trials, to succeed a times and fail b times. What is the chance that it
will succeed c times and fail d times in the next (c+d) trials?

Symmetry between success and failure, and independence of trials from
each other, may be assumed.

One classical solution, due to Bayes, assumes a "uniform prior", and
comes out as

(a+c)!(b+d)!(c+d)!(a+b+1)!
-----------------------------------------
a!b!c!d!(a+b+c+d+1)!

In the particular case a=b=c=d=1, we have seen something happen 1 of 2
times, and ask for the chance it will happen once in the next 2 tries,
and the formula gives an answer of 2/5. 

Note that the naive solution of estimating the probability of success
"p" at 0.5 based on the first 2 trials, and calculating from this the
chance for 1 success in the next 2 trials as 1/2, is unsatisfactory
because every other value for p gives a chance of less than 1/2 to this
outcome.

But is 2/5 really the right answer? Is the uniform prior a valid
assumption? Could there be a better prior distribution, or a better
solution that does not correspond to assuming a prior distribution and
calculating a la Bayes? Other solutions have been proposed, and there is
no consensus.

Suppose you knew nothing about two basketball teams except that the
first two games they played were split. Is the chance the next two will
be split really two fifths? It would be nice to have an explanation for
this number that would satisfy the proverbial "man in the street".

Let's calculate one more case. Suppose a=2, b=1, c=0, d=2. We are
calculating the chance that, if one team has won 2 out of 3 from another
team, it will lose the next two (in other words, how likely it is that
the second team would have prevailed if the 3-game playoff series had
been extended to 5 games). 

The formula gives 1/5; how does this accord with common sense? The naive
calculation assumes p=2/3 and comes out with 1/9; the largest answer
consistent with common sense is 1/4 (you assume that two teams in a
playoff series are very evenly matched; SOME team must be ahead after 2
games but that needn't change the estimate chance the other team will
win the next 2 games very much).

The other big problem in the foundations of statistics involves
hypothesis testing rather than prediction. Suppose that under condition
I we have a success and b failures; under condition II (which excludes
condition I) we have c successes and d failures. We wish to test the
hypothesis that it makes a difference which of the two conditions
obtains, against the "null hypothesis" that it makes no difference. What
kind of formula, in terms of a, b, c, and d, should we apply here? We
could apply the chi-square test to the 2x2 contigency table ((a b)(c
d)). Or, we could calculate Fisher's "likelihood ratio", which comes out
to 

f(a+c)f(b+d)f(c+d)f(a+b)
-------------------------
f(a)f(b)f(c)f(d)f(a+b+c+d)

where f(x) = x^x (x to the xth power).

There is no consensus on how to approach this fundamental problem,
either.

These issues can be discussed by anyone who understands basic arithmetic
-- no course in statistics is needed to understand the PROBLEMS, even
though some of the SOLUTIONS that have been offered are technical. I'd
like to see what mathematicians who are NOT statisticians think of these
basic questions. (I'd especially like to see replies which discuss
actual algorithms involving the parameters a,b,c,d to keep things
concrete!)

-- Joe Shipman