# Data Modelling and Fitting

## Probability (Frequency) Distributions

In high energy physics, the most frequently encountered standard distributions governing frequencies (e.g. for events) are the Poisson distribution, the Gaussian distribution, and the binomial distribution.  The statistical literature is replete with excellent discussions of these and other distributions, and a full explication will not be given here.   (See the statistics books and statistics reference section of this site.)  However, here a few rules of thumb and guidelines are in order:

• Don't confuse the Gaussian with the Poisson case.  For large numbers of expected events (mu) the Gaussian distribution is a good approximation of the Poisson distribution, but how large is large enough?  The diagram on this page illustrates the devation  of the distributions as a function of n.  Even at mu=40 there are still deviations of up to 0.5% in the probabilities; in relative terms, the ratio of Gaussian to Poisson probability in fact diverges to infinity on the tails!
• The Poisson distribution is an approximation of the binomial distribution in the limit where p (the probability of success) is small and N (the number of trials) is large.  As in the Poisson-Gaussian case, however, it is important to be aware of the magnitude of deviations if you employ such an approximation.
• If one is interested in estimating p (the efficiency, or success rate) from n and N, then it is important to note that one cannot use the binomial result sqrt(n)/N: the variance on the estimate of p is p(1-p)/N, which is approximated by n(N-n)/N3 in the range where n is not near 0 or N.  In that range, one should consider using asymmetric errors calculated from (for example) a Bayesian posterior density.  CDF note 5894 presents a prescription for such a method.

## Chi Square or Likelihood?

One of the most frequently occurring problems in high-energy physics is to compare an observed distribution with a prediction, for example from a simulation.  Indeed, an analysis might be designed to extract some physical parameter from the simulation or prediction which best fits the data. (See also the page on goodness of fit tests.) For data points with Gaussian errors, one can write a chi square function

Here the yi are the observed data, and the yi with ~'s are the predictions of the model at each point. The chi square expresses the deviation of the observed data from the fit, weighted inversely by the uncertainties in the individual points.  The chi square can be either used to test how well a particular model describes the data or, if the prediction is a function of some parameters ak then the optimal values of the parameters can be found by minimizing the chi square.

Indeed, the main attractiveness of the chi square in high energy physics lies in extracting such parameters, particularly in the case where the function y(a1, a2, ...,am) is linear in the ak. In this case, one can find the parameters, and their Gaussian uncertainties, by inverting an m-by-m matrix to find the covariance matrix. This well-known technique is described in many statistical references.

The main pitfall here is that the purely Gaussian case is in fact rather rare, usually because the data points come from Poisson-distributed numbers of events which are not well approximated by Gaussian distributions. Using a standard chi square approach in such cases leads to biased estimates of both the parameters and their uncertainties. So pervasive is this problem that

The CDF Statistics Committee recommends that likelihood-based methods be employed rather than chi square minimization in all numerically tractable cases where there is any doubt about the applicability of Gaussian statistics, particularly in describing low-statistics distributions of event quantities.

A likelihood function simply expresses how likely the observed distribution is, given some model.  For the Gaussian case one can in fact write
and thus minimizing the chi square is equivalent to maximizing the (log) likelihood.  (This is the reason that people usually take a change in the log likelihood of one half unit to be indicative of a "one sigma" change...) Even in the non-Gaussian case the usual approach is, in fact, to write a likelihood function based on the appropriate probability distributions, and maximize the likelihood (minimize the negative of the log of the likelihood) with respect to changes in the parameters, and infer the uncertainties on the resulting parameters from the change in -lnL with respect to changes in the parameters.

## Writing Likelihood Functions

In most analyses, the data samples are unique enough that no pre-packaged program will be able to do the final analysis, and the high-energy physicist has no alternative but to write a special program.  At the core of this program will be some function which calculates the likelihood L of one's data xi given some parameters ai, and will in most cases also depend on some additional auxiliary (or "nuisance") parameters and/or uncertainties.

Most likelihood functions will express the product of the probabilities for observing the data based on the Poisson probability in each data bin (or each event in the unbinned case). In writing likelihoods incorporating systematic uncertainties it is important to keep in mind the nature of each uncertainty: whether it affects signal, background, or both, and whether it results in uncorrelated or correlated effects, bin-to-bin or source-to-source. These latter effects can be tricky and in some cases intractable. See the recommendations section on systematic uncertainties for a more detailed discussion of this issue.

## Function Minimization

The most commonly employed program in high energy physics to perform function minimization is MINUIT, by Fred James of CERN.  MINUIT is embedded into ROOT and PAW, but can also be run by stand-alone programs.  The documentation is quite extensive and useful, see also Joel Heinrich's  tips and tricks for using MINUIT.

For hard-core types faced with functions of many parameters (> 100) one might find it useful to use the FUMILI package, a now-ancient but extremely fast gradient-descent program written by Soviet missile scientists in the early 1960's.  FUMILI has been successfully used on problems on which MINUIT has failed to converge at all, or converged too slowly.

For least-squares minimization of functions which are linear in the unknown parameters, there are several packages to determine the parameters, errors, and covariance matrices given the input data.  It is straightforward (and rewarding), however, to write one's own software to perform these procedures.

John Conway