Summary of CAF Specs
In the following we summarize the numbers provided in the CAF specs
as detailed in
CDF Note 4072 and
CDF Note 4100,
as well as our present best guess at the datasets as presented in
CDF Note 4718 (trigger proposal, see p.107 for summary table) and
CDF Note 5565 (datasets and streams).
For the latter, we updated the information to be consistent with the
present summary table in CDF Note 4718.
CAF specs
The intention of the CAF specs from 1997
seem to have been to define central analysis facilities that would
provide a level of "user satisfaction" similar to Run I.
We'll need to get some info on what level of satisfaction there
was in Run I. Preferably something quantitative like
"prior to summer conferences dddd it took X hours to for user
analysis skim on dataset Y"
Summary of CDF 4072 and 4100:
-----------------------------
1200 MIPS sec for full reconstruction of an average event
56,000 MIPS CPU power for reconstruction
(that's the PC farm, I suppose, and is thus
irrelevant for us here.)
90,000 MIPS CPU power for analysis (that's central facilities, I suppose)
This was obtained by scaling Run I capability by a factor 25.
For comparison, the luminosity ratio Run II/Run I is 20.
160TB of PAD data. For kicks, you can arrive at a very similar
estimate by taking 75Hz L3 output @ 1e32 = 750nb
750nb * 2fb-1 = 1.5e9 events
1.5e9 events @ 100kB/event = 150TB of PAD output.
The actual estimate was done in a much less
naive fashion.
18% of PAD = 28TB desired to be disk resident @ central facility
This was chosen as it is the same fraction as Run I.
4TB data skimming/serving per day on central facility
which translates into a 50MB/s I/O bandwidth requirement.
This number was arrived at as follows:
5% of events are part of Run I official data sets
assume we want to be able to run through 1/2 of that in one
day => 5% * 0.5 * 160TB = 4TB.
Apart from these numbers, CDF/DOC/COMP_UPG/PUBLIC/4100 also has a
variety of more or less detailed usage cases for actual analyses
that were done in Run I.
fkw notes:
----------
Some of us should take a look at the use cases they
report in 4100 in some detail, and figure out if the 4TB/day is
sufficiently well justified. E.g. one might naively expect that in
Run II the fraction of useful data is much larger than 5% given
the change in Level 3 trigger !?!
CDF 4718 and 5565 summary
CDF 5565 describes 9 streams:
-----------------------------
Stream brief description Xsection [nb] out of L3
---------------------------------------------------------------------
1 express: J/psi,W,Z,zero-bias ~20
2 High-Et leptons et al. ~60
3 photon triggers (mostly High-Et) ~50
4 Di-tau and alike (mostly High-Et) ~70
5 zero-bias & diffraction ~33
6 missing Et no leptons ~80
7 QCD jets ~50
8a hadronic 2-track ~110
8b lepton+track b-trigger ~50
9 J/psi and other di-lepton ~40
Note: fkw split 8 into 8a and 8b because 5565 is missing the 100nb
hadronic 2-track triggers that are included in latest version of 4718.
Miscellaneous other numbers and comments
- The raw data format comes to roughly 250kB per event
- The as of yet non-existent PAD format is expected to come to
50-100kB per event.
- A typical ntuple format like stntuple is shooting for 5-10kB per event.
- It is expected that all physics groups will use the same PAD
format. PAD format allows to redo the reconstruction of the full event.
- It is unlikely that physics groups will agree on one and only
one common ntuple format.
- It is unlikely but not impossible that some physics groups will
decide to have an ntuple format rather than PAD as their main
physics data format as early as summer 2002.
Modified: Mon Aug 20 12:07:52 CDT 2001
Frank Würthwein