Hypothesis Testing


It is a good morning exercise for a research scientist to discard a pet hypothesis every day before breakfast. It keeps him young. Konrad Lorenz

The topics setting limits, signal significance, and goodness-of-fit, presented in the previous sections, are special cases of the more general topic hypothesis testing. Briefly:

setting limits:
The goal is a statement similar to: "the rate of process X is less than r at confidence level alpha".
signal significance:
The goal is a statement similar to: "the rate of process X is greater than 0 at confidence level alpha".
goodness-of-fit:
We have only one hypothesis: the data are described by distribution having a specified form (often with variable parameters that are determined from a fit to the data). We want to test how consistent the actual data are with that hypothesis.
hypothesis testing:
Having assigned the single hypothesis case to the goodness-of-fit category, in this section we concentrate on multiple-hypothesis cases that do not fit the other categories.
In general, one has n hypotheses (n>1) with respect to a specified data set, and one wants to calculate the level of support provided by the data for each hypothesis. To illustrate the concept we provide a few examples of hypotheses to be tested:
  1. We are given a pair of (already identified) muon tracks, and we wish to test the following 2 hypotheses: (We specify positively identified muons to simplify this example; in real life one must generally include the possibility that the tracks may have been produced by some other particle.)
  2. Time-of-flight, drift chamber dE/dx, calorimeter, muon chamber, and other information is to be considered (if present) for a certain long-lived charged track of momentum p, and the (5) hypotheses to be tested are particle ID = electron, muon, pion, kaon, and proton.
  3. An experimenter has a highly significant signal in the decay B** -> B pi, but it is unclear by inspection whether the B** signal is one wide resonance, or several narrower resonances separated in mass. Motivated by theory, the (4) hypotheses that the experimenter wishes to test are that the signal mass-spectrum is characterized by 1, 2, 3, or 4 Breit-Wigner functions.
  4. Given an event with an identified neutral B-meson decay, and using all available information in the event, we wish to test the (2) hypotheses that the B (which oscillates between B° and B°-bar as time passes) was a B° (vs a B°-bar) at its production time t=0 (i.e. B-flavor tagging).

n.b. Examples 2-4 are not constructed to be simple or amenable cases, but rather to be difficult or problematical in some respect.


The Test Statistic

The most important (and often the most challenging) step in hypothesis testing is selecting the test statistic. The test statistic is some function of the data t(x), where x is a vector that represents the data in a less refined fashion, while t is a vector (with fewer components than x) that preserves the desired information but removes (some of) the irrelevant content.

In Example 1, above, x could be considered to be raw drift chamber hits, while t might be the vector specified by these 5 components: (mass, mass-uncertainty, vertex-chisquare, curvature-product, curvature-product uncertainty). The function t(x) then takes a 2×96=192 component vector into a 5 component vector (96 hits per COT track). (Or alternatively, x might be considered to be the vector formed by concatenating the 2 tracks' helix parameters and error matrices, already a statistic with only 2(5+15)=40 components.)

Selecting an appropriate test statistic is facilitated by an understanding of the physics involved. Specifically, one wishes to preserve the information that helps us to distinguish among the competing hypotheses, while simultaneously eliminating information that has no power to discriminate among the hypotheses. In our example t(x) we selected:

The physical considerations behind selecting these components for t are undoubtedly clear to most readers. However, it is possible that we have eliminated too much information. For example, if we Lorentz boost into the µ+ µ- rest frame there is an angle between the µ+ momentum in that frame and the direction of the boost. It is quite possible that this angle also contains information that could be used to distinguish between the 2 hypotheses---one would check whether this angle could be usefully employed by performing Monte-Carlo studies.

Conceptually then, the method of constructing t(x) that we employed above is to attempt to identify all the (possibly) relevant variables using physical intuition, and to evaluate their performance (to decide whether we should use them or not) using Monte-Carlo tests. There is another approach which is often used to help select t(x): the (artificial) neural network.

In the neural network approach, we feed the data x (or, more likely, a statistic t(x) based on the data) into an array of "neurons" that have been trained (using Monte-Carlo data sets) to distinguish between the hypotheses. After the requisite training phase, the neural net will become a fixed function tnn(x) (or tnn(t(x)), if we fed the statistic t(x) to the neural network) mapping the data to the neural network statistic tnn. In this way, a neural net can help us to further reduce the dimensionality of our statistic.


Performing the Test

Labeling our hypotheses H0, H1, H2, ... Hn-1, we denote the probability distribution for t in the case that the ith hypothesis (Hi) is true as f(t|Hi). Temporarily adopting a Bayesian approach in formulating the test procedure, we denote the prior probabilities for our hypotheses as a0, a1, a2, ... an-1 (which sum to 1). The posterior probability pi for the hypothesis Hi is then given by pi=aif(t|H i)/[a0f(t|H 0)+a1f(t|H 1)+...+an-1f(t| Hn-1)]. The result of the test (in the Bayesian version) is the posterior probability for each of the hypotheses.

In the case n=2, the posterior probability for H0 is given by p0=a0r/[a0 r+a1] where r is the likelihood ratio, given by r=f(t|H0)/f( t|H1). The posterior probability p0 monotonically increases with r, so maximizing p0 is equivalent to maximizing r (even if we don't know what values to assign to a0 and a1).

Often one wishes to define an acceptance region in t-space for the hypothesis H0 that (among all regions with the same acceptance for H0) has the maximum power to reject the hypothesis H1; one can show that this is accomplished by accepting any t that satisfies r(t)>c for some cutoff value c (this is the Neyman-Pearson lemma). One would adjust the constant c to provide the desired level of acceptance for H0. In this form the test is known as the likelihood ratio test. The result of the test in frequentist language would be "the data pass (or fail) the test at significance level alpha" (alpha being 1 minus the selection efficiency for H0).


Hypothesis Tests vs Goodness-of-fit Tests


Comments


References


Joel Heinrich
Last modified: March 7, 2002