Recommendations on Additional Requirements for Run 2 Production System Ping Yeh, Paoti Chang Academia Sinica Dec 14, 1998 As can be seen in the Run 1 production error list, there are many types of errors happened during data production in Run 1. Most of them can be avoided if more thoughts were put into error prevention in the design stage of Run 1 production system. The conclusion we draw from the Run 1 error list is that we should seriously consider putting some more requirements into the design of the Run 2 production system. We come up with 3 conceptual requirements and many technical requirements as a reflection from the error list. When making this list, we intend to make it in complement with CDF note 4810. So we assume the reader is familiar with concepts elaborated in CDF note 4810. Our ``additional'' requirement list as of 14 December 1998 follows: ------------------------------- Concepts ------------------------------- 1. Error definition / prevention / identification / handling. There should be a list of errors that could happen in the beginning, then people can work out how to . prevent them in the design stage, . identify them in run time, and then . handle them. The next 2 concepts comes from the philosophy of error prevention. 2. Module decoupling. The couplings among tasks should be minimized. For example, the reconstruction and tape-reading should not be in the same job. This seems to be a consensus right now. 3. Assertion. Each task or module should have an assertion statement before executing. For example, a typical assertion could be "there are files for me to process, and all resources needed for this process is available (including database constants, CPU, output disk, ... etc)." If the assertion fails the process should not start. ==> This leads to allocation of resources like output disks, which was already implemented in Run 1 but was in the job script. So job fails *AFTER* they are submitted and started. It leads to manual cleanups. Jobs should not be submitted at all when assertion fails. ------------------------------- Some more detailed requirements ------------------------------- 3. Job priorities. Re-processing failed jobs must have higher priority than processing new jobs because PAD merging requires files from a complete run. Priorities are also needed in other cases. We have to make sure FBS supports priorities. To be specific, even if a high priority job is submitted later than a low priority job, it starts first when resources are available. 4. Correct bookkeeping record of online data logger. In many case it was found that incorrect bookkeeping of raw data leads to reprocessing. We need that the bookkeeping record is correct that . have one record for each file that's written to tape, . have correct attribute for file, to prevent errors like "incomplete run due to accelerator shutdown or data logger crash", "bad runs", ... etc. 5. Up-to-date database constants. 6. Ability to reprocess from DST in case RAW data tape is unreadable. This is already written in CDF note 4810. 7. Knowledge about limits imposed by hardware or operating system. The limits should be understood as much as possible in advance to prevent errors like "stuck rsh to local" or "can't write to file systems more than 96% full". 8. Up-to-date database constants. The reconstruction job should be able to know when the databases are updated last time for its assertion. 9. A notification scheme when job finishes. The control program must know if a job succeeded or failed and why it failed. 10. Flexibility in output. For example, in some cases it is desirable to switch off PAD output. 11. Easy to change reconstruction/split UIC. In Run 1 the UIC files were made by a fortran program, which made it time-consuming to change UIC to accomodate new triggers, new output banks or change the UIC as a whole for 630 GeV run. 12. Able to reprocess dst events with newest run constants. For example, we may only need to redo SVX tracking since it requires more accurate alignment constants. 13. We have to closely communicate with the data handling group to understand the information of raw data files on tapes and the stagging requirements for the dst/pads. The bookkeeping files should contain enough information and it's easy to access and modify.