bar

Questions Submitted to the CDF Statistics Committee

bar

The committee receives many questions from CDF members.  This page lists a small sampling of these questions, in some cases edited and/or generalized from the original, with (hopefully) clear and concise answers and links to relevant notes and references. Click on a question to jump to the answer.

Members of the CDF collaboration can email their questions to the committee.


Sample selection:
  1. Is it sensible to use estimates of an efficiency which are a little larger than unity, or should unphysical values be excluded?
  2. I have efficiency measurements that turn out to have values at or near 100% but with limited statistics. What should I use for the error bars on these points?
  3. If, in a series of measurements, some estimates of the parameter of interest (e.g. a cross-section) are significantly different from the average, is it sensible to exclude them?
  4. How do I calculate the error on the efficiency of a cut, when the numbers of events before and after the cut were obtained as the signal components of two separate fits to signal plus background?
Data modeling and fitting:
  1. Is there a way to calculate meaningful pulls for constrained-fit problems?
  2. Is there a proper way to calculate the error on the ratio of two Poisson random numbers?
  3. What is the optimal parametrization of the background shape in the vicinity of a mass bump?
  4. When doing a likelihood fit, is it sensible to evaluate the goodness of fit by comparing the negative log-likelihood (NLL) obtained from the data to a distribution of NLL values from toy Monte Carlo fits? This technique is suggested by the RooFit documentation.
  5. I have two measurements with known errors but unknown correlation. How can I combine them to extract the best value?
Systematic uncertainties:
  1. Should one include a contribution to the systematic uncertainty on a measurement for effects that result from varying the event selection criteria?
Significance calculations:
  1. How can we quantify the discrepancy between an observation of 453.2 ± 29.4 events in some data sample and an expectation of 316.5 ± 25.4 events from Monte Carlo simulation (both uncertainties combine statistical and systematic effects)?
  2. What is the proper procedure for reporting the significance of a signal in a simple counting experiment, when there is some uncertainty in the background underneath that signal?
  3. Suppose I observe a Gaussian signal peak on top of a polynomial background, and wish to determine the significance of the signal. One method is to do a binned fit of the data spectrum assuming it consists of background only, and then to see how the fit chi2 improves when the fit is repeated with a Gaussian added to model the signal. How can we interpret the chi2 difference between the two fits?
  4. I am searching for a new particle in several channels; how can I calculate a combined significance from all these searches?
  5. Can we use a Bayesian posterior distribution to calculate significances?
  6. I discovered an unexpected bump in some distribution I was looking at. How do I evaluate the significance of this observation?
Presentation of results:
  1. How should we round off measurement results? For measurements such as Q1 = 10.73 ± 0.21(stat) ± 0.14(sys) MeV and Q2 = 66.96 ± 0.39(stat) ± 0.14(sys) MeV, it is clear that the second digit after the decimal point doesn't have much meaning, and yet removing it would have a huge effect on the systematic error.

 

 

 

 

 

 

 

 

  1. Is it sensible to use estimates of an efficiency which are a little larger than unity, or should unphysical values be excluded?

    A question is whether the efficiency has been correctly estimated, in that some procedures cannot yield unphysical values (e.g. binomial procedures of selecting a sample, and then finding the fraction satisfying the requirement). Others however can (e.g. independent Poissons to estimate the total number and the number satisfying the requirement), and we assume that this is the case here.

    Deciding to reject efficiencies larger than unity while accepting those that are physical will clearly result in a bias if the true efficiency is close to unity. If the ultimate aim of the experiment is to measure the efficiency, there is an argument in favour of using a procedure that keeps the answer physical. However, because it is only a step in determining the quantity of interest, it is better to avoid bias.

    So it is possible either (i) to use the estimate, even if it is larger than unity; or (ii) to confine it to be physical, but to attempt to estimate the bias that this introduces. This bias, of course, depends on the true (unknown) efficiency. We conclude it is easier to use (i).

    Back to top

  2. I have efficiency measurements that turn out to have values at or near 100%, but with limited statistics. What should I use for the error bars on these points?

    A Bayesian treatment of efficiency errors leads to a prescription for asymmetric errors for plotting, and a method for using such points in a maximum likelihood fit. See CDF note 5894 for details on these calculations.

    Back to top

  3. If, in a series of measurements, some estimates of the parameter of interest (e.g. a cross-section) are significantly different from the average, is it sensible to exclude them?

    When an experiment makes several/many measurements xi ± ei of some quantity, the distribution of (xi - xtrue)/ei should be normally distributed with mean zero and unit variance. Similarly yi = (xi - xmean)/ei should be normal with variance (N-1)/N (for the case of all ei equal). Thus large |yi| are unlikely, and it may well be a good idea to reject them. It is even more sensible to try to establish why these measurements gave anomalous values, because it may conceal a more important problem which is affecting the other measurements. We recommend that this be done (even though it is an a posteriori procedure), with the realisation that it could result in the rejection of other measurements which gave `acceptable' yi.

    The rejection of outliers is also used for distributions which are known to have (genuine) long tails, where the measurements in the tails are not incorrect. For example, the rate of energy loss (dE/dx) of a charged particle as it passes through a detector follows the Landau distribution. To estimate dE/dx from a set of samples, it is better to use the truncated mean rather than the overall mean, i.e. large dE/dx values are discarded, even though they are part of the true distribution. However, it is important to remember that this is a procedure which can be optimised on Monte Carlo simulation before the data is available. This is very different from looking at the data, noting that some values look anomalous, and then choosing a particular cut.

    As we understand it, the statistic used was not yi, but rather zi=(xi - xmean)/e, where e is estimated from the spread of the xi. [The fact that estimating e results in a Student's t distribution rather than a Gaussian is irrelevant when the number of measurements is large.] Thus in principle an individual xi ± ei could have large |zi| but small |yi|, because its ei was very much larger than e. We think that this is not serious, because individual runs are required to have a certain minimum number of events, so none of the individual ei should be unusually large.

    Whether or not to use a cut depends on whether it is believed that the outliers are fluctuations, or whether they represent some problem. In the latter case, the estimated number of signal events may well tend to fluctuate downwards, because any problems (e.g. misalignment, parts of the detector not functioning, drifting calibrations, etc) are more likely to deplete the signal than to enhance it. With enough data, it may be possible to check whether the outliers are asymmetric in this way. There is probably not enough data to do so in the present case.

    A symmetric cut to exclude, say, |zi| > 5, decided in advance, is probably unbiassed, and might well be sensible. The problem with using it AFTER looking at the data is that (i) the value might be `tuned' to include (or exclude) specific observations; and (ii) asymmetries may arise in yi, even from correct data, if the estimate of ei depends on xi (e.g. ei is proportional to xi or to sqrt(xi)).

    In the actual case referred to in the e-mail, it looks as if the fraction of outliers is small enough that they do not create a serious problem. However, because of the a posteriori nature of the rejection, we recommend that a systematic error is assigned to allow for this decision. [This systematic comes from the uncertainty in the bias in the estimated mean that arises from the unknown nature of the distribution of the outliers. One way of estimating the bias is to consider "worst case" distributions of the outliers and estimate the effect this has on the mean after the removal of the outliers. Clearly, knowledge of why the outliers occur would allow one to restrict the possible "worst cases."]

    Back to top

  4. How do I calculate the error on the efficiency of a cut, when the numbers of events before and after the cut were obtained as the signal components of two separate fits to signal plus background?

    The difficulty of this problem is that background fluctuations introduce correlations between the signal samples before and after the cut. The solution is to produce a histogram of events that fail the cut, and to fit that to signal plus background. We can then use the fact that the number of signal events from a fit to the histogram of events that pass the cuts is uncorrelated with the number of signal events from a fit to the histogram of events that fail the cuts, since no event is present in both histograms. With some further assumptions, we don't even have to produce this histogram of events that fail the cut, because we can predict the outcome of its fit from the two fits that are given to us.

    Let N ± dN be the fit result for the number of signal events before the cut, n ± dn the fit result for the number of signal events that pass the cut, and m ± dm the fit result for the number of signal events that fail the cut. The efficiency we are interested in is then p = n/N, but for reasonable behaviour of all the fits involved we should have N = n + m, so that p = n/(n+m). Since n and m are uncorrelated, we can apply simple error propagation to obtain the error on p:

    dp2 = [(n dm)/(n+m)2]2 + [(m dn)/(n+m)2]2

    Again assuming good behaviour of all the fits involved, we should have dN2 = dn2 + dm2 in addition to N = n + m. These two relations can be used to eliminate the unknowns m and dm from the expression for dp2. After some further algebraic simplifications, we obtain:

    dp2 = (p dN/N)2 + (1-2p)(dn/N)2

    Note that for fits that perfectly separate signal from background we expect dn = sqrt(n) and dN = sqrt(N); in this case the above formula simplifies to dp2 = p(1-p)/N, as expected from simple binomial statistics.

    The above formula has been tested in some simple examples to give very accurate results. Nevertheless, the following caveats should be kept in mind. The formula assumes the validity of the Gaussian approximation to the binomial, i.e. p should not be too close to 0 or 1 (how close depends on the number of events involved). The assumption that the fit results for n and m are uncorrelated could break down if there are constraints on the background and/or signal components of the fits, or if there are other fit parameters. And finally, it is assumed that the total number of events, before cutting and fitting, is a Poisson variate (as it would be for a cross section measurement) and not a fixed number (as it would be for a branching ratio measurement).

    Back to top

  5. Is there a way to calculate meaningful pulls for constrained-fit problems?

    There sure is. Please check CDF note 5776.

    Back to top

  6. Is there a proper way to calculate the error on the ratio of two Poisson random numbers?

    As one might conclude from CDF note 6438, there is no "standard" method, even for just a single Poisson.

    For the ratio of Poisson's, one can use the likelihood method to produce the uncertainty. See Joel's calculation: it gives explicit formulae for the parabolic error and describes how to calculate asymmetric "MINOS" errors. Caveats:

    If the observed Poisson variates are small numbers and/or proper coverage is required, a procedure with good properties is described in R.D.Cousins, "Improved central confidence intervals for the ratio of Poisson means", Nucl. Instrum. Methods Phys. Res. A 417, 391-399 (1998). This reference includes tables of 68.27%, 80%, 90% and 95% central confidence intervals for the ratio of Poisson means when the observed Poisson variates are less than or equal to 10.

    Back to top

  7. What is the optimal parametrization of the background shape in the vicinity of a mass bump? Depending on how wide a mass window one takes, there can be conspicuous curvature. Several questions have been raised about fits in such cases.
    1. If the background shape looks approximately linear, how does one decide if a quadratic term is needed? My understanding is that the reduced chisquare should go down by one unit on average if the higher-order term is NOT significant. However, that change can clearly be masked by large statistical errors if we don't have enough events.

      Not really. Fluctuations will set the overall level of the chi-squared, but not really affect changes in chi-squared as parameters are changed, or additional parameters introduced.

    2. In the case of background evaluation, the fit and the chisquare to be tested must exclude the peak region in my opinion. Do you concur?

      Sounds sensible.

    3. How do we assess uncertainties in the background shape in citing width and yield errors for Gaussian fits to mass peaks? My understanding is that because a Gaussian fit is heavily influenced by the bins in the tails of the Gaussian, the background shape in the fitting function plays a strong role in determining the width parameter. Do we compare several parametrizations and mass ranges for the background function and use the variations in the Gaussian width to make a systematic error assessment?

      If a priori you have more than one well-motivated background parametrization, then it is sensible to try all of them to assess the corresponding systematic uncertainty. If that is not the case, and you simply modeled the background by a polynomial, adding higher-order terms as needed to reduce the chi-squared as described above, then there is essentially no proper way to assign a systematic uncertainty due to background parametrization.

    4. For the yield, if we are using the integral of the Gaussian fit to cite the yield, there is the obvious systematic of whether the background function is flat, convex or concave under the peak. Is this assessed in the same way as for the width?

      Yes.

    Back to top

  8. When doing a likelihood fit, is it sensible to evaluate the goodness of fit by comparing the negative log-likelihood (NLL) obtained from the data to a distribution of NLL values from toy Monte Carlo fits? This technique is suggested by the RooFit documentation.

    Analyses at CDF are beginning to use the RooFit package. Unfortunately, RooFit has a dangerous feature that facilitates plotting the negative log-likelihood distribution from toy MC fits, supposedly to demonstrate goodness of fit. Quoting from The RooFit Toolkit for Data Modeling:

    "RooFit has been developed for the BaBar collaboration, ... and is primarily targeted to the high-energy physicists using the ROOT analysis environment, ..."

    In addition to fitting real data samples, RooFit can generate toy Monte Carlo samples and make plots of the distribution of the negative log-likelihood (NLL). In particular, this slide from the RooFit documentation shows a histogram of the NLL from toy MC, with an arrow indicating the NLL value from the fit to the real data as an indication of goodness of fit. This practice, for unbinned maximum likelihood fits, was shown to be deceptive, not really measuring goodness of fit, in CDF note 5639.

    CDF note 6123 showed that this type of test is biased even when no fitting is involved.

    For binned fits, the only acceptable quantities for goodness of fit are chi2 and the log likelihood ratio of the Poisson distribution (LLR), described in CDF note 5718.

    Back to top

  9. I have two measurements with known errors but unknown correlation. How can I combine them to extract the best value?

    Unfortunately this is not possible. If you write the measurements as x1 ± e1 and x2 ± e2 with correlation coefficient r (so the error matrix for x1 and x2 has diagonal elements e12 and e22, and off-diagonal elements r×e1×e2), then the result and its error are very sensitive to the value of r. For r = 0, we get the standard result for uncorrelated measurements xbest = w1× x1 + w2×x2, where w1 = 1/e12 / (1/e12 + 1/e22) and the error e is given by 1/e2 = 1/e12 + 1/e22.

    But, for example, for r = e1/e2 (where e1 is the smaller error), the x2 measurement is completely ignored, and xbest = x1 ± e1. For even larger r, the error on the best value decreases, and reaches zero when r=+1. Meanwhile the best value is no longer between x1 and x2, but goes further and further beyond x1. It is very sensitive to the precise value of r.

    For negative r, the best value is between x1 and x2, but the error again goes to zero as r approaches -1.

    All this makes the problem sound impossible to answer. In real life, the situation might not be quite as bad as that. Often the statistical errors for x1 and x2 will be uncorrelated, and so may some sources of systematics, leaving just a few for which we do not know the amount of correlation (even for some of them, we might know that the correlation was not, for example, negative). Then the overall r would be restricted to some smallish range, and we could investigate how the answer varied as r was moved through this range (the overall error matrix is obtained by adding the separate error matrices for the statistical errors and each source of systematics.)

    Back to top

  10. Should one include a contribution to the systematic uncertainty on a measurement for effects that result from varying the event selection criteria?

    Roger Barlow's talk on Systematic errors: facts and fictions, at the 2002 Durham Conference offers an excellent treatment of this subject. Quoting from that document:

    "Suppose a check is done, and a discrepancy emerges as some number of standard deviations. You then have to decide whether it has passed or failed the test. Your decision will depend on the size of the discrepancy (less than 1 surely passes, more than 4 surely fails), the number of checks being done (if you do 20 checks you expect on average one 2-sigma deviation) and at some level on the basic plausibility and reasons that motivated the check (you might accept that data taken in the summer were different more readily than you would accept that data taken on Tuesdays were different.)"

    "If a check passes then the correct thing to do is nothing. Put a tick in the box and move on. Do not, as is practice in some areas, add the small discrepancy to the systematic error.

    1. It's an inconsistent action. You asked is there an effect and decided there wasn't. If there was no effect then you should not allow for it. Remember that this is a check and not an evaluation of an effect.
    2. It penalises diligence. The harder you work and more thorough you are, the bigger your systematic error gets. A less careful analysis will have a smaller quoted error and get the citations.
    3. Errors get inflated. Remember how the LEP experiments appear to agree with each other and the Standard Model far too well."

    To summarize Barlow's discussion, in general no systematic should be assigned, provided the deviation d isn't very much larger than the expected statistical error e on d. If d is large compared to e, this should be a cause for concern. However, it is difficult to assign a numerical value to d/e at which concern sets in. If many checks are performed, some experts suggest using d2 - e2 for the square of the contribution of each check to the systematic error (even if it is negative!)

    Back to top

  11. How can we quantify the discrepancy between an observation of 453.2 ± 29.4 events in some data sample and an expectation of 316.5 ± 25.4 events from Monte Carlo simulation (both uncertainties combine statistical and systematic effects)?

    This problem can be formulated as a hypothesis test. We have a measurement d1 ± e1 and a simulation result d2 ± e2. The null hypothesis is that the expectation values are equal:

    E(d1) = E(d2),
    or:
    E(d1-d2) = 0.

    Let us now assume that, under the null hypothesis, d1-d2 is normally distributed around 0 with variance e12 + e22 (no correlations between e1 and e2!). The normality assumption is probably reasonable given the size of the numbers involved. Then the following is a chisquared with one degree of freedom:

    chi2 = [(d1-d2)-0]2 / (e12+e22)

    Plugging in the numbers of your e-mail, I find chi2= 12.379, corresponding to a probability of 0.0434%, i.e. 3.5 sigma in sloppy HEP language.

    By using a chisquared to quantify the discrepancy between data and Monte Carlo, we have implicitly assumed that this discrepancy could go either way, i.e. data larger or smaller than Monte Carlo. This is quite different from the problem of determining the significance of a signal on top of an expected background, in which case the discrepancy of interest is an excess. See our Q&A #12 for further details.

    Back to top

  12. What is the proper procedure for reporting the significance of a signal in a simple counting experiment, when there is some uncertainty in the background underneath that signal?

    Since the analysis is a simple counting experiment, the significance calculation is straightforward. All one has to do is calculate the probability for detecting at least as many events as were actually observed, given the background estimate. In principle this tail probability, also known as p value, is a Poisson probability, but a complication arises due to the uncertainty on the background. Often the background calculation is quite complex; a simplifying approximation is to consider the final background number and its uncertainty as the mean and width of a truncated Gaussian prior density in the Bayesian sense ("truncated", because only positive background estimates are physically acceptable). One can then average the p value for the observation over all backgrounds, weighted by this truncated Gaussian prior. This is sometimes referred to as a "prior-predictive p value".

    It is always useful to check the sensitivity of this calculation to the choice of prior, by using for example a gamma or log-normal prior instead of the truncated Gaussian one.

    Note that this question is quite different from the one considered in Q&A #11. In the latter we are asking whether a measurement we made in the data is consistent with a Monte Carlo calculation. Our null hypothesis is that they are, but if they are not, we believe that the difference could go either way (data larger than Monte Carlo or vice-versa). What we wish to know here however, is whether there is an excess in the data with respect to the background estimate. This question requires an asymmetric treatment of the two given numbers, which is what the above significance calculation does.

    Back to top

  13. Suppose I observe a Gaussian signal peak on top of a polynomial background, and wish to determine the significance of the signal. One method is to do a binned fit of the data spectrum assuming it consists of background only, and then to see how the fit chi2 improves when the fit is repeated with a Gaussian added to model the signal. How can we interpret the chi2 difference between the two fits?

    It depends on which of the Gaussian parameters you left floating in the second fit. If only the amplitude of the Gaussian was floating, and both the mean and width were fixed at some known values (from theory or from a previous experiment for example), then the chi2 difference can be referred to a chi2 distribution for one degree of freedom to determine the significance (more formally, the p value). On the other hand, if the mean and/or the width were also left floating, then you can't use a chi2 distribution to interpret the result. This is due to the fact that under the hypothesis you are testing - the data contains only background - the true value of the Gaussian amplitude is zero, but the mean and width are completely undefined. In this situation you have to run pseudo-experiments in order to determine the correct distribution of the chi2 difference under the background-only hypothesis. This can be very tricky, as fluctuations in the background spectrum can look like a signal, and when fitting you have to find the largest such fluctuation among many small ones...

    Back to top

  14. I am searching for a new particle in several channels; how can I calculate a combined significance from all these searches?

    One possibility is to combine the likelihood functions for all the channels, construct the corresponding likelihood ratio, and use pseudo-experiments to determine the probability of obtaining a likelihood ratio at least as large as the one observed.

    If for some reason it is not possible or practical to do such a calculation, and the individual significances (p-values) are independent, i.e. they are derived from test statistics whose joint probability factorizes, then another possibility is to combine the p-values directly. Unfortunately there is no unique way of doing this. The general idea is first to choose a rule S(p1,p2,p3,...) for combining individual p-values p1, p2, p3,..., and then to construct a combined p-value by calculating the tail probability corresponding to the observed value of S. Some plausible combination rules are:

    1. The product of p1, p2, p3,... (Fisher's rule);
    2. The smallest of p1, p2, p3,... (Tippett's rule);
    3. The average of p1, p2, p3,...;
    4. The largest of p1, p2, p3,...

    This list is by no means exhaustive. To narrow down the options, there are some properties of the combined p-value that one might consider desirable. For example:

    1. If there is strong evidence against the null hypothesis in at least one channel, then the combined p-value should reflect that, by being small.
    2. If none of the individual p-values shows any evidence against the null hypothesis, then the combined p value should not provide such evidence.
    3. Combining p-values should be associative: the combinations ((p1, p2), p3), ((p1, p3), p2), (p1, (p2, p3)), and (p1, p2, p3) should all give the same result.

    These criteria are of course debatable. Now, it turns out that property 1 eliminates rules 3 and 4 above. Property 2 is satisfied by all four p-values derived from the above rules. On the other hand, property 3, called evidential consistency by statisticians, is satisfied by none and is therefore not very helpful. This leaves Tippett's and Fisher's rules as reasonable candidates. Actually, it appears that Fisher's rule has somewhat more uniform sensitivity to alternative hypotheses of interest in most problems. So let's decide on Fisher's rule.

    To calculate the combined p-value corresponding to Fisher's rule, there is a neat trick one can use. First, note that the cumulative distribution of a chisquared variate for two degrees of freedom is given by exp(-x/2). So, since p-values are by definition uniform between 0 and 1, -2·ln(p), where p is a p-value, is distributed as a chisquared with two degrees of freedom. The next step is to remember that chisquared variates are additive: if you add k chisquared variates with two degrees of freedom, you obtain a chisquared variate with 2·k degrees of freedom. Hence the trick: to combine k p-values by Fisher's method, take twice the negative logarithm of their product, and treat it as a chisquared for 2·k degrees of freedom.

    For example, to combine two p-values p1 and p2, you would refer -2·ln(p1·p2) to a chisquared distribution for four degrees of freedom. The density of such a chisquared is x·exp(-x/2)/4, and the upper tail probability is (1+x/2)·exp(-x/2). Now, set x=-2·ln(p1·p2) in the latter formula, and you find [1-ln(p1·p2)] ·p1·p2. The general formula for an arbitrary number of p-values is derived similarly. It becomes a little bit more complicated because the calculation of the tail probability of a chisquared with an arbitrary number of degrees of freedom involves repeated integrations by parts. The result is: P·Sumj [-ln(P)]j/j!, where P is the product of the n individual p-values, and the sum goes from 0 to n-1.

    The above result is only strictly valid if the individual p-values are all derived from continuous statistics. If one or more p-values are discrete, then the formula will give a combined p-value that is larger than the correct one, and will therefore be "conservative". One can of course always correct for this by using pseudo-experiments.

    Back to top

  15. Can we use a Bayesian posterior distribution to calculate significances?

    The term "significance" usually designates a p-value, which is a tail probability calculated from the distribution of a test statistic (i.e. a function of the data only) under many repetitions of the measurement and assuming that no signal is present. In contrast, a Bayesian posterior distribution is the distribution of a parameter.

    It is nevertheless possible to use a Bayesian posterior distribution to calculate quantities that can be interpreted as evidence against a given hypothesis. How to do this depends on the type of hypothesis one is testing.

    For example, if one is testing H0: theta<=theta0, then the posterior tail probability below theta0 is the Bayesian probability of H0. A low posterior hypothesis probability is evidence against that hypothesis.

    On the other hand, if one is testing H0: theta=theta0, then the probability of H0 is always zero if theta is a continuous parameter.  A more sensible method consists in calculating a highest posterior density (HPD) interval for theta, say with credibility alpha.  H0 can then be excluded at the alpha credibility level if theta0 is outside the interval.  It is also possible to define a Bayesian significance level in this case, as one minus the largest alpha for which theta0 is not inside the interval. If you do not like highest posterior density intervals because of their lack of invariance under reparametrization, use likelihood intervals instead.  These are intervals with a given credibility content and such that the likelihood of any point inside the interval is larger than the likelihood of any point outside it.  Likelihood intervals have been shown to be robust against variations in the prior, see L.A. Wasserman, "A robust Bayesian interpretation of likelihood regions," Ann. Statist. 17, 1387 (1989).

    Back to top

  16. I discovered an unexpected bump in some distribution I was looking at. How do I evaluate the significance of this observation?

    A possible method, let's label it Method A, is to fit your spectrum both with and without a Gaussian (or Breit-Wigner) component to accomodate the bump, then take twice the difference in log-likelihoods between the two fits (call this quantity Q), and finally look up a table of chisquared deviates with the appropriate number of degrees of freedom to convert Q into a p-value. If there are not enough events in the data for the chisquared approximation to be valid, this conversion can be done by running a large number of Monte Carlo toy experiments, each of which contains only background events. Repeat the data analysis on each toy experiment, and calculate Q by looking for a bump at the same location and with the same width as those found in the data. Make a histogram of the Q values. The p-value is then the integral of this histogram to the right of the value of Q observed in the data, divided by the integral of the whole histogram. Of course, if the number of events in the data is sufficiently large, this Monte Carlo calculation will yield the same result as the one based on looking up tables of chisquared deviates.

    The problem with Method A, and it is a crucial one, is that it does not take into account the fact that the bump was *unexpected*. In principle the bump could have occurred anywhere in the spectrum, and it could have been wider or narrower. The reason this is important is that with the p-value method of assessing significance, evidence and discovery claims are justified via their associated error rates. A 5-sigma discovery claim for example, seems pretty secure because if the background-only hypothesis is true, the probability of incorrectly claiming discovery is only 2.87E-07, or one in ~3.5 million claims. Now, if your discovery occurred after looking in a hundred different places (or even just planning to look in a hundred different places!), that means you gave Chance one hundred opportunities to fool you, and you should degrade the error probability accordingly: instead of 1 in 3.5M, the probability of error becomes now 1 in 35000, which is only 4 sigma's! The same kind of error rate degradation is at work when you scan a spectrum for a bump of unknown location and/or width; unfortunately the effect is now more difficult to quantify because location and width are (usually) continuous parameters. The safest way to proceed (let's call this Method B) is to run a large number of background-only toy Monte Carlo experiments, just as for Method A. Only this time don't assume that you know the location and width of the bump beforehand. Instead, scan the entire spectrum in each toy experiment to find the most significant bump, and compute the corresponding Q value. The rest of the p-value calculation is then identical to Method A.

    In some searches one is looking specifically for narrow resonances, in which case the width of an interesting bump will be determined by the measurement resolution and will be known beforehand. One can then fix the bump width accordingly when doing fits to toy experiments in Method B, and the resulting error probability degradation will be less severe. In general, any information known before looking at the data can be used to constrain the toy experiment fits.

    The final significance quoted in a CDF publication should always be whatever comes out of Method B. However, to illustrate the soundness of our statistical methods it may be useful to mention the result of Method A somewhere in the discussion section of the publication, as a way to quantify the magnitude of the look-elsewhere effect that Method B incorporates.

    Back to top

  17. How should we round off measurement results? For measurements such as Q1 = 10.73 ± 0.21(stat) ± 0.14(sys) MeV and Q2 = 66.96 ± 0.39(stat) ± 0.14(sys) MeV, it is clear that the second digit after the decimal point doesn't have much meaning, and yet removing it would have a huge effect on the systematic error.

    The Particle Data Group provides some sensible rules in section 5.3 of the Introduction to their Review of Particle Properties. These rules state:

    "5.3. Rounding : While the results shown in the Particle Listings are usually exactly those published by the experiments, the numbers that appear in the Summary Tables (means, averages and limits) are subject to a set of rounding rules. The basic rule states that if the three highest order digits of the error lie between 100 and 354, we round to two significant digits. If they lie between 355 and 949, we round to one significant digit. Finally, if they lie between 950 and 999, we round up to 1000 and keep two significant digits. In all cases, the central value is given with a precision that matches that of the error. So, for example, the result (coming from an average) 0.827±0.119 would appear as 0.83±0.12, while 0.827±0.367 would turn into 0.8±0.4."

    "Rounding is not performed if a result in a Summary Table comes from a single measurement, without any averaging. In that case, the number of digits published in the original paper is kept, unless we feel it inappropriate. Note that, even for a single measurement, when we combine statistical and systematic errors in quadrature, rounding rules apply to the result of the combination. It should be noted also that most of the limits in the Summary Tables come from a single source (the best limit) and, therefore, are not subject to rounding."

    "Finally, we should point out that in several instances, when a group of results come from a single fit to a set of data, we have chosen to keep two significant digits for all the results. This happens, for instance, for several properties of the W and Z bosons and the lepton."

    Following these rules, the results Q1 = 10.73 ± 0.21(stat) ± 0.14(sys) MeV and Q2 = 66.96 ± 0.39(stat) ± 0.14(sys) MeV would certainly be presented as such in the case of Q1 (even with the errors combined in quadrature), and in the case of Q2 probably as well (via either the "group of results" discussion or the systematic being less than the 0.355 cutoff).

    Back to top


Valid HTML 4.01!

Luc Demortier
Last modified: Thu Dec 11 18:00:57 EST 2008