Benchmarking Details


For people unfamiliar with some of the CDF specific terms below, check out the Glossary.
------------------------------------------------------------------------
        Minutes of the initial meeting of the
        benchmarking sub-group, CDF computing
        review committee, held 8th August 2001

 Corrections from Art Kreymer, Frank Wuerthwein Pasha Murat
 and Bill Ashmanskas implemented.

 Present:
 Art Kreymer (AK),
 Bill Ashmanskas (BA),
 David Waters (DW),
 Richard Hughes (RH),
 Armin Reichold (AR)
 Frank Wuerthwein (by video)
 Pasha Murat joined aprox. 50min late due to other meeting (PM)
 Paul Keener (by phone)

 FW stated that results from this group should be completed by the end of
 August.
 He will contact the physics groups and ask them how much computing of what
 kind (typical
 tasks to be benchmarked by us) they will need to complete their run II work.
 From later meetings and from discussions with the physics groups, we will
 learn what physics needs
 are, i.e. how many of these basic operations per week for CDF.  Then we can
 make a dot product as a comparison of the various
 platforms/compilers/optimizations/flavors.
 The goal is a bottom-up (from physics analyses) calculation of resource
 needs.


 Optimisation:
 --------------
 In a long discussion about the in's and outs of optimisation we concluded
 that three optimisation levels will be benchmarked
 1. no_opt
    this is what we have now with all debug symbols present and absolutely
    no optimisation done. This is available on all official build
    platforms for all frozen releases
 2. min_opt
    only C++ inlining allowed, no backend optimisation allowed, expected
    20% performance increase, AK explained in detail why many C++ wise
    people consider this a must. AK reported that initial problems with
    this option being different on different platforms have been fixed by
    him.
 3. max_opt
    optimise the hell out of it. AR pointed out that (as Rob Kennedy
    explained to him) some code can not be fully optimised (two pointers
    into the same array as function arguments) Expect large performance
    increases up to factors of 2. This also leads to much smaller
    executable size on disk (80% of current size are debug symbols) though
    this will not do much for size in memory. Max_opt is regularly build
    on all official platforms but does not always compile. Not all
    max_opt exe's actually run.

 We decided to not look into the issue of whether or not all options give the
 same results. This was felt unnecessary for the task of benchmarking and
 should be left to the developers when the time comes.

 What tasks to benchmark:
 -------------------------

 There was unanimous consensus that all "typical" activities that analysis
 and simulation on central platforms may require should be benchmarked in all
 optimisation levels on all platforms. We stressed the usefulness of creating
 a set of benchmarks that could be
 rerun on a new platform now or in six months' time.  We are working to
 provide scripts for doing the benchmarking, not just a single set of
 numbers.

 All tasks that require data will be performed first using simulated data and
 if time permits sparse cross checks with real data will be made. This is to
 ensure that all detector components are present and that we know the physics
 content of the events.

 RH agreed that he will provide a ttbar sample of approximately 1GB size for
 these purposes.

 The task to benchmark fall into three main categories:
 A) CPU benchmarking
 B) IO benchmarking
 C) code development cycle
 (people that signed up for the task are mentioned at the end of each task):

 A) CPU benchmarking
 1. cdfsim, ttbar including generator RH
 2. production on the ttbar RH
 3. user skim that reduced events by 10E-3 (FW will give PM a module that
    reconstructs open charm (find D0 or D+) and does a vertex fit.
    The emphasis in this will be to measure the time that the uster-module
    measures and compare it to the time the input module takes. This will
    provide a lower limit as to how many events can be pumped through a
    user analysis per second and thus will be indicative of future CPU needs
    and how these may evolve with improvements in the input modules and
    streamers in the future.
    There was a discussion of skimming vs ntuple making, in that the output
 in
    once case is a smaller EDM file and in the other case is an ntuple.
    Another difference is that a skim may only look at a few words in an
    event, while making an ntuple-maker looks at a large part of the event,
    once it is selected.  We wanted to make sure we covered both kinds of
    usage.
 4. stntuple making from the ttbar sample PM
 5. run a compiled macro across an stntuple that extracts some quantities
    from all branches and computes for example an invariant mass and then
    plots that into a histogram. PM

 B) IO benchmarking
 1. Edm utilities for reading and writing files with and without puffing
    DW
 2. DH_Input module and the puff module only in one job (is equivalent to
    AC++Dump without any output BA
 3. COTQ bank (compressed cotd bank) reading as an example of an attempt
    to create a speedy and well written streamer BA
 4. StNtuple reading with a job that just tries to read stntuples as fast
    as possible PM
 5. Bonnie benchmark suite DW
 6. cp, hdparm (where available), dd. BA

 C) development cycle
 1. build, remove, tar a complete release AK
 2. checkout, touch file and gmake stntuple libraries and executable AR


 Platforms:
 ----------
 The platforms on which the above tests are to be performed are:
 1. Linux: cdfpca, 8way server
 2. SGI: fcdfsgi2. Special after-downtime should be asked for when a free
         machine will be needed to get comparable results.
 3. SUN: fcdfsun2 will only become available with some luck at the end of
         this month. So tools should be checked on fcdfsun1. fcdfsun1 was
         considered unusable due to limited memory and the extremely long
         compilation times for memory extensive builds but sun1 has lots
         of disk space)

 Software Release and tools:
 ----------------------------
 It was decided to use software from 3.18.0 only for the benchmarking
 programs. AK has made available a subdirectory called benchmark in the
 validation package. This is where all the scripts that run benchmarking jobs
 will go. All committee members have write access to this.
 DW will send around some script fragments that will write a line containing
 timing information into a systematically named file to ease later automatic
 extraction of data for collation and presentation purposes.


 Scaling:
 ---------
 DW pointed out and agreement was reached that all testing should be
 performed as a function of the number of jobs running in parallel.
 This is of course limited on fcdfsgi2 and special precautions have to be
 taken to not impact normal users too much.

 AK noted that scaling on SMP's is a tricky thing and mentioned that the
 system time overhead for a rebuild on sgi2 increased from 50% (of user time)
 to 200% of user time when moving from a single rebuild to a 12 fold parallel
 rebuild. He also noted that this behaviour has gone away since SGIU
 implemented some fixes to their memory locking strategy.

 Art also pointed out that i/o intensive operations such as the codegen phase
 of building (500 million single character reads for the codegen phase of a
 normal package) produce wildly varying performance.

 He quoted for a given codegen phase (forgot what it was)
         Linux: 10 min
         sun  :  1 hour
         sgi  :  4 hours

 AK mentioned a utility called showproc that can trace the actions of a given
 process continuously (so we know for example if it starts swapping)
 This utility is functional on sun and sgi and can be found in the cdfsoft
 account on these machines. RH agreed to port this utility to Linux and make
 it available to the committee members.

 Edited by Armin Reichold

Modified: Mon Aug 20 10:37:12 CDT 2001 Frank Würthwein