Hypothesis
Testing
It is a good morning exercise for a research
scientist to discard a pet hypothesis every day before
breakfast. It keeps him young. Konrad Lorenz
The topics
setting limits,
signal significance, and goodness-of-fit, presented in the
previous sections, are special cases of the more general topic
hypothesis testing. Briefly:
- setting limits:
- The goal is a statement similar to: "the
rate of process X is less than r at confidence
level alpha".
- signal significance:
- The goal is a statement similar to: "the
rate of process X is greater than 0 at confidence level
alpha".
- goodness-of-fit:
- We have only one hypothesis: the data
are described by distribution having a specified form (often
with variable parameters that are determined from a fit to the
data). We want to test how consistent the actual data are with
that hypothesis.
- hypothesis testing:
- Having assigned the single hypothesis
case to the goodness-of-fit category, in this section we
concentrate on multiple-hypothesis cases that do not fit the
other categories.
In general, one has n hypotheses
(n>1) with respect to a specified data set, and one
wants to calculate the level of support provided by the data for
each hypothesis. To illustrate the concept we provide a few
examples of hypotheses to be tested:
-
We are given a pair of (already
identified) muon tracks, and we wish to test the following 2
hypotheses:
- The muons were produced through the
decay
J/psi -> µ+ µ
-. (Signal)
- The muons are a random pairing; they
might not even have been produced at the same vertex.
(Background)
(We specify positively identified
muons to simplify this example; in real life one must
generally include the possibility that the tracks may have
been produced by some other particle.)
- Time-of-flight, drift chamber
dE/dx, calorimeter, muon chamber, and other
information is to be considered (if present) for a certain
long-lived charged track of momentum p, and the
(5) hypotheses to be tested are particle ID = electron, muon,
pion, kaon, and proton.
- An experimenter has a highly significant
signal in the decay
B** -> B pi, but it is unclear
by inspection whether the B** signal is one wide
resonance, or several narrower resonances separated in mass.
Motivated by theory, the (4) hypotheses that the experimenter
wishes to test are that the signal mass-spectrum is
characterized by 1, 2, 3, or 4 Breit-Wigner
functions.
- Given an event with an identified
neutral B-meson decay, and using all available
information in the event, we wish to test the (2) hypotheses
that the B (which oscillates between B° and
B°-bar as time passes) was a B° (vs a
B°-bar) at its production time t=0
(i.e. B-flavor tagging).
n.b. Examples 2-4 are not constructed to be
simple or amenable cases, but rather to be difficult or
problematical in some respect.
The Test
Statistic
The most important (and often the most
challenging) step in hypothesis testing is selecting the test
statistic. The test statistic is some function of the data
t(x), where x is a
vector that represents the data in a less refined fashion, while
t is a vector (with fewer components than
x) that preserves the desired information but
removes (some of) the irrelevant content.
In Example 1, above, x could
be considered to be raw drift chamber hits, while t
might be the vector specified by these 5 components: (mass,
mass-uncertainty, vertex-chisquare, curvature-product,
curvature-product uncertainty). The function
t(x) then takes a 2×96=192
component vector into a 5 component vector (96 hits per COT
track). (Or alternatively, x might be considered to
be the vector formed by concatenating the 2 tracks' helix
parameters and error matrices, already a statistic with only
2(5+15)=40 components.)
Selecting an appropriate test statistic is
facilitated by an understanding of the physics involved.
Specifically, one wishes to preserve the information that helps
us to distinguish among the competing hypotheses, while
simultaneously eliminating information that has no power to
discriminate among the hypotheses. In our example
t(x) we selected:
- The
µ+ µ- mass
and its uncertainty, since muons from J/psi decay will have a
mass consistent with 3.097 GeV, while random pairs will give a
"random" mass.
- The vertex-chisquare, since pairings of
muons from separate p-pbar interactions (but acquired as a
single event) will often give a large chisquare, while muons
from J/psi decay will have a chisquare consistent with 1 degree
of freedom.
- The product of the track curvatures (and
uncertainty), since this product will be consistent with being
negative for muons from J/psi decay, but will often be positive
for random pairings.
The physical considerations behind
selecting these components for t are undoubtedly
clear to most readers. However, it is possible that we have
eliminated too much information. For example, if we Lorentz boost
into the
µ+ µ- rest
frame there is an angle between the µ+
momentum in that frame and the direction of the boost. It is
quite possible that this angle also contains information that
could be used to distinguish between the 2 hypotheses---one would
check whether this angle could be usefully employed by performing
Monte-Carlo studies.
Conceptually then, the method of
constructing t(x) that we employed
above is to attempt to identify all the (possibly) relevant
variables using physical intuition, and to evaluate their
performance (to decide whether we should use them or not) using
Monte-Carlo tests. There is another approach which is often used
to help select t(x): the (artificial)
neural
network.
In the neural
network approach, we feed the data x (or, more
likely, a statistic t(x) based on the
data) into an array of "neurons" that have been trained (using
Monte-Carlo data sets) to distinguish between the hypotheses.
After the requisite training phase, the neural net will become a
fixed function tnn(x) (or
tnn(t(x)),
if we fed the statistic t(x) to the
neural network) mapping the data to the neural network statistic
tnn. In this way, a neural net can help
us to further reduce the dimensionality of our
statistic.
Performing the
Test
Labeling our hypotheses
H0, H1, H2,
... Hn-1, we denote the probability
distribution for t in the case that the ith
hypothesis (Hi) is true as
f(t|Hi).
Temporarily adopting a Bayesian approach in formulating the test
procedure, we denote the prior probabilities for our hypotheses
as a0, a1,
a2, ... an-1 (which
sum to 1). The posterior probability pi for the
hypothesis Hi is then given by
pi=aif(t|H
i)/[a0f(t|H
0)+a1f(t|H
1)+...+an-1f(t|
Hn-1)]. The result of the test (in the
Bayesian version) is the posterior probability for each of the
hypotheses.
In the case n=2, the posterior
probability for H0 is given by
p0=a0r/[a0
r+a1] where r is the likelihood
ratio, given by
r=f(t|H0)/f(
t|H1). The posterior probability
p0 monotonically increases with r, so
maximizing p0 is equivalent to maximizing
r (even if we don't know what values to assign to
a0 and a1).
Often one wishes to define an acceptance
region in t-space for the hypothesis
H0 that (among all regions with the same
acceptance for H0) has the maximum power to
reject the hypothesis H1; one can show that
this is accomplished by accepting any t that
satisfies r(t)>c for some cutoff
value c (this is the Neyman-Pearson lemma).
One would adjust the constant c to provide the desired
level of acceptance for H0. In this form the
test is known as the likelihood ratio test. The result of
the test in frequentist language would be "the data pass (or
fail) the test at significance level alpha" (alpha being 1 minus
the selection efficiency for H0).
Hypothesis Tests vs
Goodness-of-fit Tests
- Sometimes we have only one hypothesis
H0, and we wish to test whether it is
consistent with the data. This special case (n=1) is the
goodness-of-fit test, which we
covered in the previous section. The methods of this section
are not applicable to the n=1 case. Note that the
"likelihood ratio of the Poisson distribution" described in the
goodness-of-fit section is not an example of
r(t) as defined in this section: there is
no serious hypothesis H1 corresponding to the
denominator. For this reason, the Neyman-Pearson lemma is not
applicable to the goodness-of-fit section.
- The test described in this section is
degenerate in the n=1 case. The Bayesian interpretation
of n=1 is that, if only one hypothesis is permitted, it
is by definition certainly true; no data is necessary.
Goodness-of-fit is therefore sometimes claimed to be outside
the realm of Bayesian thinking.
- The hypothesis test of this section does
not substitute for goodness-of-fit tests performed on the
individual hypotheses. For example, in the case n=2, if
the data were inconsistent with both hypotheses, the
goodness-of-fit tests would reveal this fact, while
r(t) as defined here would conceal it. In
general, one should apply both types of test. (It is also
possible that both hypotheses give an acceptable
goodness-of-fit---in this case our test based on
r(t) may still be able to distinguish
between the two since it is optimized for that purpose, while
the goodness-of-fit tests are not.)
Comments
- The form of the probability distribution
f(t|Hi) may be
obscure for some of the hypotheses Hi.
For example, in the case of example 1 above, the background
hypothesis H1 does not immediately suggest
the proper form for
f(t|H1). In such cases
Monte Carlo simulations can provide the information necessary
to construct this function empirically.
- If
f(t|Hi) is
determined at least partly by the data x,
Hi is known as a composite
hypothesis. In example 3 above (B**), if we do not
specify in advance the width of each of the Breit-Wigner
functions, but let them be determined from a fit to the data,
the hypotheses are all composite. In other words, composite
hypotheses have unknown parameters that need to be extracted
from the data.
- There are a host of special purpose
statistic functions. For example, Student's t can be
computed using two data sets known to have the same variance
(though of unknown value) to test whether they have the same
mean.
- The likelihood ratio test can be
justified from both the Bayesian and frequentist points of
view.
- Fisher's linear discriminant is a
classic test statistic which is the precursor to more
sophisticated methods.
- The alert reader will have noticed
significant overlap between this section and the selection section of these
recommendations: many of the tests that are commonly used to
test hypotheses (at the end of an analysis) can also be
employed (at the beginning) to select the data to be
analyzed.
References
Joel Heinrich
Last modified: March 7, 2002