Some numbers
As usual from my exercises, with no pretense to have the truth,
a collection of usefull numbers for analysis strategy planning.
Beware the average numbers, since I integrated over all Italians I
assumed things even out, but small groups with a particular interest
may differ by significant factors.
- CDF data in run 2 = 1PByte = 1000 TByte (PADS = 200 TB).
This assumes 30 Hz average DAQ, limited by tape cost.
- One data set size (PAD): 10^7 events = 1 TB.
this is an average, high-Pt/low-Pt may vary up to a factor 5.
See
CDF5565 and notice 5nb x 2fb^-1 = 10^7 events.
- One ntuple = 1GB.
This is just by definition. One ntuple file is
something for interactive use, one minute max.
At ~10 MByte/sec we have ~1GB/minute.
- 50 = PAD->Ntuple data reduction factor
Order of magnitude, assume 2K/event in final user ntuple. It does
not matter if start point is PAD or StNtuple, final ntuple will
not be more then a few hundred variables to be usable so 1~2K
- One hour to make one Ntuple (?).
How long to produce one n-tuple ? Using the above
factor 50 in data size from PADs to Ntuple, 50 GBytes, at
10MBytes/sec (optimistic) is one hour.
- 10 minutes to transfer one ntuple = 10 Mbit/sec
Data transfer: I assume if you need a data set that will take
X time to go through, you will be willing to wait up to 10 times X
to have it on your table. Likely you will go through it >> 10 times.
- 10 Mbit/sec/user = network need from FCC to trailer
if people
make ntuple at FCC and copy to trailers
- 10 Mbit/sec/user = network need to offsite
for sites that make ntuple at
Fermilab and copy at home and want/can work with the same performance
as being in the trailers.
- Note: when I assumed processing PADs at 10MB/s
I have no claim that is possible or appropriate given how many
times one will need to go through PAD, is only to say that
if: one process PADs at 10 MBytes/sec
then: 10 Mbit/sec is good enough to copy the result on
the desktop
- 1 Mbit/sec/user = network need to offsite
for sites that make ntuple at
Fermilab and copy at home and are/have to be satisfied in spending
about as much clock time copying the ntuple home as creating it
from the PADs (read PADs at 10Mbyte/sec, write an output reduced in
size by 100 and copy data at 1 Mbit/sec)
- 100 GB = total ntuple need for one user
one data set of 1TB gives 20GB, a few versions, MC. Likely
an underestimate
- 100 : the number of times to go through all the Ntuple files
of one sample
- 1 hour: the time it takes to do once the above at 50MBytes/sec
lower limit, assuming Root/Paw is as fast as SCSI
- 10 : the number of times to go through a full data set PADs
to make a new ntuple
you can take this as just a round even number, no study
support that, some people remember having done this "a few" to
"several" times for run 1.
- 1 year : the time it takes to make one analysis
- 10^6 events: the sample to use to tune the analysis
this is the daily bread. Assume you go through 10% of the
events to decide if your code is doing the proper thing (I mean
physics wise, bugs should be found faster). Then process the full
sample only the 10 times above
- 1 day: the time it takes to study a new selection
assuming you can go through the 10^6 events at 10 Hz. If those are 100KB/ev PADS it means 1MByte/sec
- 1 hour: the time it takes to study a new selection
assuming you can go through the 10^6 events at 100 Hz.
using 100KB/ev PADs you must read at 10MBytes/sec
(well within the disk bandwidth, but maybe challenging for
an object-oriented data I/O, like AC++ in spring 2001),
or you need 10KB/ev for a more comfortable 1MB/sec.
Notice that in the end, there is no way to process events significantly
faster then 100Hz without some very tough reduction in data size.
Also the 100Hz range calls for a very efficient way to transfer data from
the disk into the structures used by analysis code.
Stefano Belforte
Last modified: Mon Aug 13 20:08:30 CDT 2001