Recommendations
on Signal Significance
Significance, as
it is normally defined, is the frequentist probability of making an observation
that is at least as inconsistent with the null hypothesis as the observation
actually made. In the statistics literature, this is formally known as the
Òp-valueÓ of the observation. We
(and the Particle Data Group) recommend that this terminology be used. See Reference [1] for a more detailed
introduction to this concept.
Some Facts About P-Values
There are a number
of useful ÒfactsÓ about p-values that assist in understanding how to use them:
- A
p-value expresses the probability for a given hypothesis, of obtaining data
at least as extreme as ours. For example, if the hypothesized distribution
is a Poisson of mean 2.9 and we have observed 10 events, the p-value is
Sum(n=10-->infinity) exp(-2.9)*2.9^n/n!
Small p-values imply that the data is unlikely for the given model (and
the deviation is in the ÒinterestingÓ direction).
- In
ideal situations, and assuming the hypothesis is correct, p-values will
be uniformly distributed between 1 and zero. In contrast, when the data
is discrete rather than continuous (e.g. for a Poisson distribution, where
the data values are only integers), the possible p-values are also discrete,
are not equidistant in p, and do not have equal weights. The p-value distribution
cannot be uniform in the sense of dn/dp being constant. However it is Òas
uniform as possibleÓ for a discrete distribution, with Prob(observing p.le.c)
= c, where c is the location of any p-value.
- A
p-value is a useful quantity. a)
It measures the compatibility of the data with the given hypothesis.
b) It enables p-values from different
experiments to be combined (even though this procedure has some degree of
arbitrariness associated with it). The combined p-value determines how consistent
the collection of experiments are with the hypothesis. Assuming that the
p-value distributions are uniform, p-values may be combined by using the
formula given by Eq (13) in Reference [1]. A slightly unfortunate feature of
this formula is that, when combining 3 p-values, the result can be different
if all 3 are combined directly; if p_1 and p_2 are combined, and the result
is then combined with p_3; if p_2 and p_3 are combined, and the result is
then combined with p_1; etc. c) See also point 4).
- Measures
of significance are also used in Hypothesis Testing, where a p-value is
used to accept or reject a given hypothesis. One defines, before the measurement
is performed, a significance level alpha and then uses a test statistic
(like a measure of goodness of fit) to see whether the data are consistent
with the hypothesis at this level, by checking whether p.le.alpha. The expected
rate of `Errors of the First Kind' (i.e. how often the hypothesis is rejected
when it is in fact true) is then alpha, and not the p-value.
The p-value may be reported but its actual value is not relevant
to the statistical conclusion.
- A
p-value measures the probability of observing DATA at least as extreme or
unlikely as ours, assuming the hypothesis is true. It does NOT measure the
probability that the HYPOTHESIS IS TRUE, based on our data. (See point 10
for an example.) This is an example of the difference between the probability
of data, given a hypothesis; and the probability of the hypothesis, given
the data. In particular, the
following inferences are both WRONG: I) If p=3%, the probability of rejecting a true hypothesis
is 3%. This is determined by
alpha, not p. II) If p=7%, the probability that the
hypothesis is in fact correct is 7%.
The p-value cannot say anything about the probability of the hypothesis
being correct (that is not even a frequentist concept!).
- P-values
are often used to summarize measures of ÒGoodness of Fit,Ó ie, where we
are comparing data distributions to a given hypothesis. Such measures are
not to be regarded as a test of the null hypothesis. Similarly, a single p-value does
not provide a means of Hypothesis Testing, in which two hypotheses are compared.
Thus, a p-value can be used to see whether data is consistent with the Standard
Model. If the p-value is small, this in itself does not imply that the Standard
Model should be rejected. A useful procedure would be to compare the quality
of the fits of the data to the Standard Model and to an a priori credible
alternative. That still doesnÕt prove that the Standard Model is correct
though.
- P-values
are invariant with respect to monotonic transformations of the data variable.
They are not invariant with respect to the choice of statistic.
- A
Composite Hypothesis is one which involves free parameters (Contrast a Simple
Hypothesis, which is completely defined). To calculate the compatibility of data with a Composite
Hypothesis, choices must be made about what to do for the free parameter(s).
A simple case would involve fitting the parameters using as a statistic
to be minimized such as the weighted sum of squared deviations between data
and the hypothesis. The probability for observing this chi-squared value
or a larger value, corresponding to N-f degrees of freedom [N and f are
the numbers of data points and of free parameters] is a p-value for the
hypothesis. This is equivalent to using as p-value the largest one (i.e.
the best fit) as the parameter(s) are varied. In other cases, it is possible
to use one statistic for determining the best values of the parameters,
and another for measuring the discrepancy between data and prediction. In
determining the p value, Monte Carlo simulation is likely to be very useful.
Because the parameters have been allowed to vary, this
p-value may be biased upwards.
- Nuisance
parameters can cause complications. Possible ways of dealing with them are
discussed briefly in the Appendix below.
- Here
is a simple example illustrating that p-values do NOT give the probability
of the hypothesis being wrong: Consider a particle identifier for pions,
using dE/dx or the Cherenkov ring angle. For the pion hypothesis, the p-value
distribution should be flat between 1 and zero. Now suppose that muons result
in a p-value distribution of 1 - 0.1*(p-0.5) i.e. not too different from
that for pions (because the pion and muon masses are similar), but slightly
more peaked at small p. For a sample of tracks with equal numbers of pions
and muons, those with p close to 0.1 for the pion hypothesis will have a
pion/muon ratio of 1/1.04. With a perhaps more realistic particle composition
of 100 times more pions than muons, the small p pion/muon ratio becomes
100/1.04. In neither case would the wrong rejection of the pion hypothesis
be anywhere near 10%
Recommendations for the Care and
Feeding of P-Values
The following recommendations
should be considered when determining the p-value of an observation.
- To
estimate a p-value, one must first define how one classes all possible observations
given a specific null hypothesis. For example, if one is looking for a signal for the production
of a certain class of events, the statistic x could be the number of candidate
events in each observation. In this case, a large number of candidate events above
the expected background rate would be increasingly inconsistent with the
null hypothesis (in general, the chosen statistic must be able to discriminate
between a specific null hypothesis and the other classes of hypotheses that
are of physics interest). The choice of x is not, however, unambiguous. For example, if one is comparing
a data histogram to one predicted by a Monte Carlo calculation, one could
use the chi-square statistic, or a binned Kolmogorov-Smirnov distance, or
any number of other measures. The
p-value will depend on the choice of statistic. See Reference [2] for a case study
of multiple significance measures.
- If
one knows the frequentist probability density p(x) of the random variable
x assuming the null hypothesis, and then makes an observation x_0, then
the p-value would be the integral of p(x) from x_0 to infinity. This assumes that x is a one-sided
statistic, with smaller values implying better agreement with the null hypothesis.
- One
often cannot analytically determine p(x). In that case, one can resort to a Monte Carlo calculation
where one estimates p(x) from the distribution of x in the MC experiments.
The Monte Carlo calculation should sample the complete ensemble of possible
experimental outcomes given the null hypothesis (this principle also should
be satisfied by p(x)). It should take into account uncertainties in the inputs
into the Monte Carlo calculation.
Given that significance is a frequentist concept without Bayesian
counterpart [3], systematic uncertainties should be treated in a frequentist
manner. For example,
if one is looking for an excess of events over a background with a known
Gaussian uncertainty, the common procedure whereby one fluctuates the mean
of a Poisson random variable according to a Gaussian density is not correct
from a frequentist point of view. The correct procedure, and a further discussion of ensembles,
can be found in reference [4]. For an example that violates this, see Example D in the
Appendix.
- In
the case where one makes several, possibly correlated, simultaneous observations
of random variables, one must first categorize the outcomes according to
some measure that determines their consistency with the null hypothesis.
This may be the joint probability of the observations assuming the
null hypothesis (this may not be the most sensitive or optimal measure),
or some other function of the random variables.
If the random variables are totally uncorrelated, then the combined
significance is given by Eq. (13) in reference [1].
- In
cases where one is seeking a signal in several different channels, a straight-forward
way to estimate the p-value of the simultaneous observations is to combine
all channels together into a single measure of the signal rate [5].
This may not be optimal if the channels have very different background
rates.
- Although
it is common to see p-values quoted in terms of the equivalent number of
standard deviations a measurement should be from the expected mean of a
normal distribution, it is more straight-forward to quote the actual p-value
(ie., probability) and state explicitly the technique and assumptions used
to estimate it. If you do quote
equivalent standard deviations, remember that an upper limit should be converted
to a one-sided Gaussian p-value estimate.
- The
design of an experiment usually involves estimating the sensitivity of a
particular approach. In cases
where one is observing a number of signal events S and one expects a number
of background events B, one often sees measurement techniques optimized
on the basis of the ratio S/sqrt(B), or S/sqrt(S+B) (see Reference [6] for
a thorough discussion). In both cases, one is in fact making the assumption that
S and B are normally distributed distributions. These may result in misleading ÒoptimalÓ
strategies, especially in cases where S and or B have non-Gaussian probability
densities (as is the case where they represent numbers less than of order
10 events).
- Posteriori
decisions on the random variable used to measure a signal (such as the selection
criteria used to identify a candidate event sample) make it difficult if
not impossible to accurately calculate a p-value for a given observation
once the observation has been made. Blind analyses avoid this specific problem, and should
be considered when a search for new phenomena is undertaken. See Reference [7] for a description
of blind analyses.
- When
one uses binned data to search for a possible signal and the location of
the expected signal is not known, the p-value will be larger than a simple
Poisson probability calculation would predict. See reference [8] for more details on how to account for
this effect.
- Always
completely document the technique used to determine the p-value for an observation.
Do not assume that it is too trivial or is well-known. In our experience, neither assumption is correct. One may always refer to an earlier
paper where a complete description of the technique has been provided.
References:
[1]
P. Sinervo, ÒSignal Significance in Particle Physics,Ó CDF Note 6031 and hep-ex/0208005, (July
2002).
[2]
L. Demortier, "Assessing the significance of a deviation in the tail
of a distribution", CDF Note 3419 (November 1995).
[3]
The closest Bayesian concept is the ``Bayes factor,ÕÕ which is a ratio of
posterior Bayesian probabilities for two different hypotheses.
[4]
L. Demortier, ÒConstructing Ensembles,Ó CDF Note 6125 (September 2002).
[5]
R. Hollebeek, H.H. Williams and P. Sinervo, ÒThe evaluation of upper limits
for top quark production using combined measurements," CDF Note 1109
(January 1990).
[6]
G. Punzi, ÒSensitivity of Searches for New Signals and Its Optimization,Ó
ArXiv:physics/0308063 (August 2003).
[7]
P. Harrison, ``Blind AnalysesÕÕ in Proceedings of the Conference on Advanced
Statistical Techniques in Particle Physics, M. Whalley and L. Lyons (ed.),
IPPP/02/39 (July 2002), page 278. See also J. Heinrich, ÒThe Benefits of Blind Analysis Techniques,Ó
CDF Note 6576 (July 2003).
[8]
P. Sinervo, In preparation.
APPENDIX:
Methods of dealing with nuisance parameters
(A)
The plug-in p-value
(B) The supremum
p-value
(C) The similar
p-value
(D) The prior
predictive p-value
(E) The posterior
predictive p-value