| Testing ProductionExe |
| STEP 1. Building the new executable |
ssh -l cdfopr fcdflnx3.fnal.gov setup cdfsoft2 4.8.4
cd /cdf/scratch/cdfopr/testRel newrel -t 4.8.4 4.8.4 cd 4.8.4 addpkg Production gmake Production.bin addpkg -h FarmTools gmake FarmTools.all mkdir pass1 cp FarmTools/scripts/files_to_process.txt pass1
| STEP 2. Running the 10K test |
We assume that the current directory is the one where the new release was built. There should be a subdirectory called /pass1 containing a file called '/files_to_process.txt'. If not, go back to last two unstructions of Step 4.
> cd /cdf/scratch/cdfopr/testRel/4.8.4
> emacs pass1/files_to_process.txt
The 'files_to_process.txt' file passes the list of files
to be processed during the 10K test.
Make sure that it points to data from recently taken runs in
the fcdflnx3:/cdf/scratch/cdfopr/10Kevents directory.
If necessary, copy new data files into that directory from the look
area on fcdfsgi2:/cdf/data08 .
Each file should contain approximately 3,000 events, so we will be
running the 10K test over 3 files at the time (remember, that FCDFLNX3
is a 8 processor machine, so it is not recommended to run more than
3 jobs at a time there!).
Invoke the run10KTest script by passing it a pass number (1) as a parameter:
run10KTest -p 1
run10KTest -p 2 -o maxopt
run10KTest -p 3 -q GCC_3_1
run10KTest -p 4 -n 10
If the run10kTest scripts does not exist, then in Step 4 someone simply forgot
> gmake FarmTools.all
Do it now and you are all set to go. A X-term window will open for
each file that you specified in 'files_to_process.txt', plus a
window with the top process. From these windows you can check on the
progress of your test.
make a record in the offline elog
The job will be runnig on the batch queue
so you can even log yourself off and leave for a while.
It will take approximately 3 to 4 hours to finish, if all goes well.
| STEP 3. Testing for crashes of the new executable |
The output of the 10K test resides in subdirectories of /pass1, named after the name of each data file. Here is an example of what you will find there:
/cdf/scratch/cdfopr/testRel/4.8.4/pass1/ar02419d.0001phys-ana: total 957975 drwxr-sr-x 2 cdfopr cdf 232 Jul 12 09:59 . drwxr-sr-x 17 cdfopr cdf 728 Jul 14 08:48 .. lrwxrwxrwx 1 cdfopr cdf 48 Jul 12 09:57 Production -> /cdf/scratch/cdfopr/testRel/4.8.4/Production -rw-r--r-- 1 cdfopr cdf 19 Jul 12 14:27 ProductionExe.29325 -rw-r--r-- 1 cdfopr cdf 269535 Jul 12 14:27 ar02419d.0001phys.log -rw-r--r-- 1 cdfopr cdf 966541585 Jul 12 14:27 ar02419d.0001phys.out -rw-r--r-- 1 cdfopr cdf 13185619 Jul 12 14:27 prodHists.rootwhere the .log file contains all of the errors and warning reported by the Production job, the .out file contains the data and the .root file contains a huge number of validation plots.
If the job crashed on a specific event and a core dump is found, rerun the Production executable on that event from the main release directory by:
selectEvents set run=123456 event=123145
> setup gdb v5_0b_external
> gdb bin/$BFARCH/ProductionExe
(gdb) set args Production/ProductionExe.tcl -i input_data_file -o junk
(gdb) run
The debugger will stop at a crash point. Type
(gdb) where
and the debugger will report location of the crash point.
Check
list of librarians
and report the crash to the responsible librarian
with CC:cdf_code_librarians@fnal.gov.
some other remarks:
1) when debugging crashes of new production patches, we may need to
reproduce the errors and get the file from tape first.
a) identify the fileset name
go to cdf db browser http://cdfdbb.fnal.gov:8520/cdfr2/databases?type=dc&cm=n
and select the report to be "DC Files", and fill the filename (eg
g8021cb9.020cjet0) into "Specify file name, optional text" field. Submit
and you will get the Fileset . (eg. PR6419.0)
b) copy the file to disk
In fcdfsgi2, do the following
setup lsf
stage launch input PR6419.0
use bjobs -q tape -u all to check if your job is running.
c) After the file is copied to fcdfsgi2,
echo /cdf/data*/PR6419.0 and you will see where the
file is stored.
rcp the file you need to fcdflnx
/cdf/scratch/cdfopr/10kevent/
2) some tips to run gdb
a) run gdb in emacs
in emacs , type M-x (M means Alt) gdb, and return to enter gdb mode
b) to load the executable and arguments
file production.exe
set args test.tcl
c) to set breakpoint , b
b AppFramework::BeginRun()
d) to run , r
e) to stop from a running processs
C-c twice (C means ctrl)
| STEP 4. Validation of the results |
Once the run10K job finishes successfully for all of the files in 'files_to_process.txt', we will still need to validate the results using the following steps:
> cd pass1/ar02419d.0001phys-ana
Copy the file FarmTools/scripts/testing.tcl and change the 'include file' command to point to the .out file of the subdirectory we are in. Also comment out the 'setInput launch' command in the DHInput talk-to and the source disable_calib line
Run the program:
> AC++Dump testing.tcl
Almost any ERLOG message should be noted. Messages similar to:
%ERLOG-e ROOT-SiHitSet: object of class SiHitSet read too few bytes
Edm_ObjectLister TBuffer::CheckByteCount() 26-Aug-2001 12:04:08
*** SiHitSet::Streamer() not in sync with data on file, fix Streamer()
means that SiHitSet output by ProductionExe can not be read back and
that the release neads to be patched to get rid of this problem.
25 ROOT-Cot2SvtMatchCol -i GlobalLibraryLog TClass::Load 8090* 8090
> cd /cdf/scratch/cdfopr/testRel/4.8.4
> concatenate_histograms pass1/subdir1/prodHist.root
pass1/subdir2/prodHist.root
.......
pass1/subdirN/prodHist.root
-o results/prodHist.root
ssh -l cdfopr fcdflnx3
cd /cdf/scratch/cdfopr/val
source .source_me
root.exe
compare_prod_hist("/cdf/scratch/cdfopr/testRel/4.10.0pre2/pass1/ar026347.0001phys-ana/prodHists.root",
"/cdf/scratch/cdfopr/testRel/4.8.4a/pass3/ar026347.0001phys-ana/prodHists.root",0.99)
click on root
click on STNTUPLE_RESULTS
click on the root file
You will see folders corresponding to validation modules which have histograms
that are different in the two input files.
To plot a histogram: right click and select "Draw EP"
Shaded histogram is the second input file, dots = first input file.
As user cdfopr on fcdflnx3:
> cd ~cdfopr
> ticket
> ssh -l cdf_val ncdf75
> mozilla &
Navigate to the "validation" bookmark near the top of the browser. One can
then browse a comparison of histograms from the recent 10k test to those from
a larger statistics test using 4.8.4.
This tool is evolving. The offline CO should consult with Rob S. to find out how to
contribute to updating this tool as needed.
| STEP 5. Concatenate Test |
We need to do the following two step concatenation test.
talk DHInput
include file ./pass2/ar0261c3.0001phys-ana/ar0261c3.0001phys.out
include file ./pass2/ar0261d4.0503phys-ana/ar0261d4.0503phys.out
include file ./pass2/ar026347.0001phys-ana/ar026347.0001phys.out
include file ./pass2/ar02734f.0001phys-ana/ar02734f.0001phys.out
include file ./pass2/ar02734f.0001phys-ana/ar02734f.0001phys.out_1
exit
path create NULL
path enable NULL
talk FileOutput
dhCache set KAHUNA
output create AA test01
output path AA NULL
AA
writeSingleBranch set f
dataSetId set wewk02
writeSingleBranch set false
fileSize set 1200
exit
output list
exit
begin -nev 10
show all
exit
The farms may do "100K tests" with the new executable. In that case, look for results in /cdf/scratch/cdfprod0/* . Any crashes will give core files that should be looked at. Report what you learn to relevant people - if necessary ask who are the relevant people. Currently (June 2002) the "100K tests" run on ~600K events. Remember that one event can occur in two or more streams; so, for example, "4 events crashing" may mean "1 event crashing 4 times".