Introduction One of the tasks of the offline shifter is to analyse crashes that occurred during production. This is of great help to the offline team, and contributes to improve the code and indirectly the physics output of the experiment. Creating a home directory Before actually starting the anlysis, it is advisable that you create a directory with your name in /cdf/opr/cdfopr/testRel/5.1.1/crash_analysis that will contain the .tcl files you will need, and a file called crash.txt that will contain your comments and findings on the crashes. Look for crashed jobs The next step is to check the processed jobs and look for crashes. Go to the home page of the production farm http://fnpcc.fnal.gov and click on the left bar for Progress Status. There, enter the production version number (currently, 5.1.1g_maxopt) and Submit. You will get a list of datasets, and the number of crashes for that specific dataset. Click on the number for the dataset you are interested in, and look for a recent crash. Beware: tables are not ordered, so you really need to go through all of them to find a suitable file to analize! Copy the run and trigger number of the event at the bottom of the crash.txt file. Location of production output Production output for crashed files is in /cdf/scratch/cdfprod0/vnumber/stream/ where vnumber is the production version number (currently 5.1.1g_maxopt_oc) and stream is the stream name (eg. Stream_H). The relevant directories here are crashed, that contains the stripped files, logs, with the log files, and core, with core files. Log file analysis First, look for the log file (for instance, ls logs | grep filename) and open it. The beginning of this log file will contain the .tcl that was used for the production. Cut'n paste it to a file in your area in crash_analysis, that you will call for instance reprod.tcl, and you'll use for later reprocessing of the event. Continue scrolling the log file, until you see the place where the error occurred (look for a line starting with %ERLOG-e). If you did not find any, write "no explicit error message found" in crash.txt, otherwise cut'n paste the error message there. Debugging the core file Next step is the debugger analysis of the core file. Go to the home directory of the release (for 5.1.1 simply type go511) and run gdb bin/Linux2-KCC_4_0-maxopt/ProductionExe_5.1.1g -core corename where corename is the full path to the core file in the /cdf/scratch/cdfprod0 area. From inside the debugger, the command which will lead you to the place where the crash has occurred, and copy the first ~10 lines of the stack into the crash.txt file. Rerunning production If you suspect the error was a not reproducible one, you can try rerunning the production again on this single event. In order to do that, some modifications to the reprod.tcl file you just created are needed. You have to give full path adresses to the source setup_* calls you see in the beginning of the .tcl (eg. source setup_input.tcl will become source /cdf/opr/cdfopr/testRel/5.1.1/Production/setup_input.tcl); then comment out the source ProductionSplitter* call, and finally change the include file to the stripped file name for the event under study. An example of what this .tcl has to look like can be found in /cdf/opr/cdfopr/testRel/5.1.1/crash_analysis/campanel/reprod.tcl Now you are ready to run again on these event. go to the home directory of your release, and run the production executable. The command you issue should look like bin/Linux2-KCC_4_0-maxopt/ProductionExe_5.1.1g /cdf/opr/cdfopr/testRel/5.1.1/crash_analysis/yourname/reprod.tcl Production will now run interactively on that event, and it can take up to few minutes. If the rerun ends with no errors, write in the crash file production finished with no errors upon rerun otherwise write ----error message upon rerun---- and cut'n paste the error message. Vagrind analysis If no error occurs, you may want to check for memory leaks and other strange behaviours To do that, you have to rerun the event enabling valgrind. Go again to the home directory of your release. You will find a subdirectory called results and inside that many others called ProductionExe.xxxx These directories contain information about previous valgrind analyses. Since you are starting a new one, increase the biggest number by one and type ./cdfopr/scripts/run.sh -J yyyy -e bin/Linux2-KCC_4_0/ProductionExe -I /cdf/ opr/cdfopr/testRel/5.1.1/crash_analysis/username/reprod.tcl -n 1:v where yyyy=xxxx+1, and username is the name you used for your working directory. The -n 1:v option enables valgrind. Now a directory result/ProductionExe.yyyy will be created, and look for a file called ProductionExe.yyyy.p.log This is a normal log file from production, plus valgrind analyses and error messages, ie lines starting with ==pid== Go through the log file, and look for these messages. Some of them are always present, and are not a problem. You can find a list of known problems in the file valgrind.normal.txt If your log file contains some error not present in this list, make an entry "Valgrind Analysis:" in the crash.txt file, and cut'n paste them. Otherwise note "Valgrind Analysis did not show specific problems".