Benchmarking Details
For people unfamiliar with some of the CDF specific terms below, check
out the Glossary.
------------------------------------------------------------------------
Minutes of the initial meeting of the
benchmarking sub-group, CDF computing
review committee, held 8th August 2001
Corrections from Art Kreymer, Frank Wuerthwein Pasha Murat
and Bill Ashmanskas implemented.
Present:
Art Kreymer (AK),
Bill Ashmanskas (BA),
David Waters (DW),
Richard Hughes (RH),
Armin Reichold (AR)
Frank Wuerthwein (by video)
Pasha Murat joined aprox. 50min late due to other meeting (PM)
Paul Keener (by phone)
FW stated that results from this group should be completed by the end of
August.
He will contact the physics groups and ask them how much computing of what
kind (typical
tasks to be benchmarked by us) they will need to complete their run II work.
From later meetings and from discussions with the physics groups, we will
learn what physics needs
are, i.e. how many of these basic operations per week for CDF. Then we can
make a dot product as a comparison of the various
platforms/compilers/optimizations/flavors.
The goal is a bottom-up (from physics analyses) calculation of resource
needs.
Optimisation:
--------------
In a long discussion about the in's and outs of optimisation we concluded
that three optimisation levels will be benchmarked
1. no_opt
this is what we have now with all debug symbols present and absolutely
no optimisation done. This is available on all official build
platforms for all frozen releases
2. min_opt
only C++ inlining allowed, no backend optimisation allowed, expected
20% performance increase, AK explained in detail why many C++ wise
people consider this a must. AK reported that initial problems with
this option being different on different platforms have been fixed by
him.
3. max_opt
optimise the hell out of it. AR pointed out that (as Rob Kennedy
explained to him) some code can not be fully optimised (two pointers
into the same array as function arguments) Expect large performance
increases up to factors of 2. This also leads to much smaller
executable size on disk (80% of current size are debug symbols) though
this will not do much for size in memory. Max_opt is regularly build
on all official platforms but does not always compile. Not all
max_opt exe's actually run.
We decided to not look into the issue of whether or not all options give the
same results. This was felt unnecessary for the task of benchmarking and
should be left to the developers when the time comes.
What tasks to benchmark:
-------------------------
There was unanimous consensus that all "typical" activities that analysis
and simulation on central platforms may require should be benchmarked in all
optimisation levels on all platforms. We stressed the usefulness of creating
a set of benchmarks that could be
rerun on a new platform now or in six months' time. We are working to
provide scripts for doing the benchmarking, not just a single set of
numbers.
All tasks that require data will be performed first using simulated data and
if time permits sparse cross checks with real data will be made. This is to
ensure that all detector components are present and that we know the physics
content of the events.
RH agreed that he will provide a ttbar sample of approximately 1GB size for
these purposes.
The task to benchmark fall into three main categories:
A) CPU benchmarking
B) IO benchmarking
C) code development cycle
(people that signed up for the task are mentioned at the end of each task):
A) CPU benchmarking
1. cdfsim, ttbar including generator RH
2. production on the ttbar RH
3. user skim that reduced events by 10E-3 (FW will give PM a module that
reconstructs open charm (find D0 or D+) and does a vertex fit.
The emphasis in this will be to measure the time that the uster-module
measures and compare it to the time the input module takes. This will
provide a lower limit as to how many events can be pumped through a
user analysis per second and thus will be indicative of future CPU needs
and how these may evolve with improvements in the input modules and
streamers in the future.
There was a discussion of skimming vs ntuple making, in that the output
in
once case is a smaller EDM file and in the other case is an ntuple.
Another difference is that a skim may only look at a few words in an
event, while making an ntuple-maker looks at a large part of the event,
once it is selected. We wanted to make sure we covered both kinds of
usage.
4. stntuple making from the ttbar sample PM
5. run a compiled macro across an stntuple that extracts some quantities
from all branches and computes for example an invariant mass and then
plots that into a histogram. PM
B) IO benchmarking
1. Edm utilities for reading and writing files with and without puffing
DW
2. DH_Input module and the puff module only in one job (is equivalent to
AC++Dump without any output BA
3. COTQ bank (compressed cotd bank) reading as an example of an attempt
to create a speedy and well written streamer BA
4. StNtuple reading with a job that just tries to read stntuples as fast
as possible PM
5. Bonnie benchmark suite DW
6. cp, hdparm (where available), dd. BA
C) development cycle
1. build, remove, tar a complete release AK
2. checkout, touch file and gmake stntuple libraries and executable AR
Platforms:
----------
The platforms on which the above tests are to be performed are:
1. Linux: cdfpca, 8way server
2. SGI: fcdfsgi2. Special after-downtime should be asked for when a free
machine will be needed to get comparable results.
3. SUN: fcdfsun2 will only become available with some luck at the end of
this month. So tools should be checked on fcdfsun1. fcdfsun1 was
considered unusable due to limited memory and the extremely long
compilation times for memory extensive builds but sun1 has lots
of disk space)
Software Release and tools:
----------------------------
It was decided to use software from 3.18.0 only for the benchmarking
programs. AK has made available a subdirectory called benchmark in the
validation package. This is where all the scripts that run benchmarking jobs
will go. All committee members have write access to this.
DW will send around some script fragments that will write a line containing
timing information into a systematically named file to ease later automatic
extraction of data for collation and presentation purposes.
Scaling:
---------
DW pointed out and agreement was reached that all testing should be
performed as a function of the number of jobs running in parallel.
This is of course limited on fcdfsgi2 and special precautions have to be
taken to not impact normal users too much.
AK noted that scaling on SMP's is a tricky thing and mentioned that the
system time overhead for a rebuild on sgi2 increased from 50% (of user time)
to 200% of user time when moving from a single rebuild to a 12 fold parallel
rebuild. He also noted that this behaviour has gone away since SGIU
implemented some fixes to their memory locking strategy.
Art also pointed out that i/o intensive operations such as the codegen phase
of building (500 million single character reads for the codegen phase of a
normal package) produce wildly varying performance.
He quoted for a given codegen phase (forgot what it was)
Linux: 10 min
sun : 1 hour
sgi : 4 hours
AK mentioned a utility called showproc that can trace the actions of a given
process continuously (so we know for example if it starts swapping)
This utility is functional on sun and sgi and can be found in the cdfsoft
account on these machines. RH agreed to port this utility to Linux and make
it available to the committee members.
Edited by Armin Reichold
Modified: Mon Aug 20 10:37:12 CDT 2001
Frank Würthwein