Testing ProductionExe
After each new production .pre release, the ProductionExe is checked for problems over a sample of approximately 10,000 event for spotting major problems such as not being able to read the output data, or core dumps, or ERLOG-e messages.

Notes before beginning

Steps:
  1. build new executable
  2. run 10K test
  3. check for the crashes
  4. validate the results
  5. concatenate test


STEP 1. Building the new executable

The following steps need to be followed only once to build new version production executable.
STEP 2. Running the 10K test

We assume that the current directory is the one where the new release was built. There should be a subdirectory called /pass1 containing a file called '/files_to_process.txt'. If not, go back to last two unstructions of Step 4.

      > cd /cdf/scratch/cdfopr/testRel/4.8.4
      > emacs pass1/files_to_process.txt

The 'files_to_process.txt' file passes the list of files to be processed during the 10K test.
Make sure that it points to data from recently taken runs in the fcdflnx3:/cdf/scratch/cdfopr/10Kevents directory.
If necessary, copy new data files into that directory from the look area on fcdfsgi2:/cdf/data08 .
Each file should contain approximately 3,000 events, so we will be running the 10K test over 3 files at the time (remember, that FCDFLNX3 is a 8 processor machine, so it is not recommended to run more than 3 jobs at a time there!).

Invoke the run10KTest script by passing it a pass number (1) as a parameter:

      run10KTest -p 1

If the run10kTest scripts does not exist, then in Step 4 someone simply forgot

      > gmake FarmTools.all

Do it now and you are all set to go. A X-term window will open for each file that you specified in 'files_to_process.txt', plus a window with the top process. From these windows you can check on the progress of your test.
make a record in the offline elog
The job will be runnig on the batch queue so you can even log yourself off and leave for a while. It will take approximately 3 to 4 hours to finish, if all goes well.


STEP 3. Testing for crashes of the new executable

The output of the 10K test resides in subdirectories of /pass1, named after the name of each data file. Here is an example of what you will find there:

  /cdf/scratch/cdfopr/testRel/4.8.4/pass1/ar02419d.0001phys-ana:
  total 957975
  drwxr-sr-x    2 cdfopr   cdf           232 Jul 12 09:59 .
  drwxr-sr-x   17 cdfopr   cdf           728 Jul 14 08:48 ..
  lrwxrwxrwx    1 cdfopr   cdf            48 Jul 12 09:57 Production -> /cdf/scratch/cdfopr/testRel/4.8.4/Production
  -rw-r--r--    1 cdfopr   cdf            19 Jul 12 14:27 ProductionExe.29325
  -rw-r--r--    1 cdfopr   cdf        269535 Jul 12 14:27 ar02419d.0001phys.log
  -rw-r--r--    1 cdfopr   cdf      966541585 Jul 12 14:27 ar02419d.0001phys.out
  -rw-r--r--    1 cdfopr   cdf      13185619 Jul 12 14:27 prodHists.root
where the .log file contains all of the errors and warning reported by the Production job, the .out file contains the data and the .root file contains a huge number of validation plots.

If the job crashed on a specific event and a core dump is found, rerun the Production executable on that event from the main release directory by:

  1. emacs Production/setup_input.tcl
  2. include the problematic event by adding the following line to the DHInput talk-to:
             selectEvents set run=123456 event=123145
       
  3. resubmit the job under debugger with the commands (CDFSOFT2 environment has to be set up)
             > setup gdb v5_0b_external
             > gdb bin/$BFARCH/ProductionExe
              (gdb) set args  Production/ProductionExe.tcl -i input_data_file -o junk
              (gdb) run
        
    The debugger will stop at a crash point. Type
              (gdb) where
    
    and the debugger will report location of the crash point. Check list of librarians and report the crash to the responsible librarian with CC:cdf_code_librarians@fnal.gov.
  4. if the crash is not reproducible: remove the "selectEvents" line from Production/setup_input.tcl and rerun the job under debugger on the whole input file as described above in order to reproduce exactly the crash conditions.

some other remarks:

 
1) when debugging crashes of new production patches, we may need to 
 reproduce the errors and get the file from tape first. 

a) identify the fileset name 
    go to cdf db browser http://cdfdbb.fnal.gov:8520/cdfr2/databases?type=dc&cm=n
  and select the report to be "DC Files", and fill the filename (eg
g8021cb9.020cjet0) into "Specify file name, optional text" field. Submit 
and you will get the Fileset . (eg.  PR6419.0)

b) copy the file to disk 

   In fcdfsgi2, do the following
      setup lsf
      stage launch input PR6419.0
   use bjobs -q tape -u all to check if your job is running. 

c) After the file is copied to fcdfsgi2, 
       echo /cdf/data*/PR6419.0   and you will see where the 
 file is stored. 
        rcp the file you need to fcdflnx 
        /cdf/scratch/cdfopr/10kevent/ 

2) some tips to run gdb  
a) run gdb in emacs 
   in emacs , type M-x (M means Alt) gdb, and return to enter gdb mode
b) to load the executable and arguments
   file production.exe
   set args test.tcl
c)  to set breakpoint ,  b
    b  AppFramework::BeginRun()
d) to run , r
e) to stop from a running processs 
    C-c twice (C means ctrl) 


STEP 4. Validation of the results

Once the run10K job finishes successfully for all of the files in 'files_to_process.txt', we will still need to validate the results using the following steps:

  1. Go one of the output subdirectories of the 10K test, for instance
         > cd pass1/ar02419d.0001phys-ana 
    
  2. make sure that you can read the output file.
    One can use either run AC++Dump or Edm_ObjectLister to read the data. With AC++Dump you can get all the info as Edm_ObjectLister plus useful statistics.

    Copy the file FarmTools/scripts/testing.tcl and change the 'include file' command to point to the .out file of the subdirectory we are in. Also comment out the 'setInput launch' command in the DHInput talk-to and the source disable_calib line

    Run the program:

         > AC++Dump testing.tcl
    

    Almost any ERLOG message should be noted. Messages similar to:

    %ERLOG-e ROOT-SiHitSet: object of class SiHitSet read too few bytes
             Edm_ObjectLister TBuffer::CheckByteCount() 26-Aug-2001 12:04:08
    *** SiHitSet::Streamer() not in sync with data on file, fix Streamer()
    
    means that SiHitSet output by ProductionExe can not be read back and that the release neads to be patched to get rid of this problem.


  3. Look for errors in the .log files: grep .log files for ERLOG messages and report to cdf_code_librarians@fnal.gov
  4. Check the histograms by first concatenating all of prodHist.root files to a unique file with the command
         > cd /cdf/scratch/cdfopr/testRel/4.8.4 
         > concatenate_histograms pass1/subdir1/prodHist.root 
                                  pass1/subdir2/prodHist.root 
                                          .......
                                  pass1/subdirN/prodHist.root
                                     -o results/prodHist.root 
    
  5. Here is an example of how to use a tool which automatically compares each histogram in two input files and shows you which histograms are different:
    ssh -l cdfopr fcdflnx3
    cd /cdf/scratch/cdfopr/val
    source .source_me
    root.exe
    compare_prod_hist("/cdf/scratch/cdfopr/testRel/4.10.0pre2/pass1/ar026347.0001phys-ana/prodHists.root",
       "/cdf/scratch/cdfopr/testRel/4.8.4a/pass3/ar026347.0001phys-ana/prodHists.root",0.99)
    click on root
    click on STNTUPLE_RESULTS
    click on the root file
    You will see folders corresponding to validation modules which have histograms
      that are different in the two input files.
    To plot a histogram: right click and select "Draw EP"
    Shaded histogram is the second input file, dots = first input file.
    
  6. Rob Snihur has a nice histogram comparison tool located on ncdf75. One presently needs to supply him with the concatenated histogram file from the 10k test (the individual histogram files can also be used).
         As user cdfopr on fcdflnx3:
         > cd ~cdfopr
         > ticket
         > ssh -l cdf_val ncdf75
         > mozilla &
    
    Navigate to the "validation" bookmark near the top of the browser. One can then browse a comparison of histograms from the recent 10k test to those from a larger statistics test using 4.8.4. This tool is evolving. The offline CO should consult with Rob S. to find out how to contribute to updating this tool as needed.


STEP 5. Concatenate Test

We need to do the following two step concatenation test.

  1. 10 events test to make sure the filename is correct .
    One can use AC++Dump concatenate_test.tcl (a sample tcl is shown below) to emulate the farm concatenate process.
    The input data is the output file from 10K test. The file size is set to a nondefault value 1200 for a side check.
    You are expecting output files with standard dataset names. For example, w20261c3.0000ewk0

    
    talk DHInput
      include file ./pass2/ar0261c3.0001phys-ana/ar0261c3.0001phys.out
      include file ./pass2/ar0261d4.0503phys-ana/ar0261d4.0503phys.out
      include file ./pass2/ar026347.0001phys-ana/ar026347.0001phys.out
      include file ./pass2/ar02734f.0001phys-ana/ar02734f.0001phys.out
      include file ./pass2/ar02734f.0001phys-ana/ar02734f.0001phys.out_1
    exit
    
    path create NULL
    path enable NULL
    
    talk FileOutput
      dhCache set KAHUNA
      output create AA test01
      output path   AA NULL
      AA 
        writeSingleBranch set f
        dataSetId set wewk02
        writeSingleBranch set false
        fileSize set 1200
      exit
      output list
    exit
    
    begin -nev 10
    show all
    exit
    
    

  2. fullscale concatenate test
    If the output filename is correct, go through a full run by removing the "-nev 10" in the tcl file.

Step 9. Checking crashes on the farms of the new executable

The farms may do "100K tests" with the new executable. In that case, look for results in /cdf/scratch/cdfprod0/* . Any crashes will give core files that should be looked at. Report what you learn to relevant people - if necessary ask who are the relevant people. Currently (June 2002) the "100K tests" run on ~600K events. Remember that one event can occur in two or more streams; so, for example, "4 events crashing" may mean "1 event crashing 4 times".


Last changes: by Pasha Murat
Legal Notices