Analysis_Control Data Management and its Pitfalls Liz Sexton-Kennedy 1. Introduction In the original design of Analysis_Control data management was strictly the responsibility of the input and output modules. However these modules had to obey certain guidelines (1) so that user analysis modules could be written which would run with any I/O module. For the most part this has been successful, the only modules that need a specific I/O module are the mixing routines which need to read from two files. With the exception of the mixing modules, event data read in from ybos sequential files are read into one array, IW. Data that is meant to be added to an event must be added to the IW array before the output module is called. It is the responsibility of the input module to delete the old event before reading in the next one. If this is not done YBOS will overwrite the contents of memory with the contents of the next event for banks that exist in both with the same name and number, for banks without name conflicts they will stay in the array. This would mix the data between the two events and be a disaster. The above describes the main data handling model. There are other special purpose I/O functions needed by CDF analysis software. They are: history of processing information, 2. user parameter definitions 3. database information and 4. Analysis_Control generated information. Each is discussed in the next four sections. 2. $ Banks Used to Hold History of Processing Information There are banks which are read in with the begin record (the begin record may be either a begin_run record or a begin_file record) that are expected to remain in the IW array for the duration of the run processing. These are the "$" banks and for this purpose they usually contain some history of processing information, either cut values that the data set was made with or calibration information that was used. When the next begin record is read in the "$" bank on the file should overwrite the one in memory because it may be a new run and therefore need the new information. This, however, is not the default YBOS behavior and leads to one of the design pitfalls I'll discuss later. The user must issue an Analysis_Control command to get this behavior (2). Note also that A_C does not actively delete "$" banks when the run number changes, so there is potential for a problem if the input file happens to have missing begin records. I believe the need for this sort of information will still exist in Run II. It can either be stored in a run keyed database or in the begin records on the data stream itself. For small sized data the overhead of reading from and writing to a database maybe burdensome. The advantage of keeping it in the data stream is that the data set becomes self-contained. This is an important feature for the pad or micro-dst level data set, since there may be a desire to analyse these at remote institutions. We will probably need remote access to the central database, however, the amount of information needed from it should be minimized. 3. $ Banks Used to Hold User Parameter Definitions Another use of "$" banks is to store Talk_To information. In this case the banks are created in IW before a file is even opened and it is expected that they remain in memory for the entire job. For this purpose the "$" bank in memory should not be overwritten by a matching bank on the input file. This leads to a conflict in desired behavior between this use and the one described in the previous section. YBOS handles all "$" banks the same, it doesn't have the capacity to favor a bank on a data file for one bank while at the same time it favors the one in memory for a second bank. The only reason for having banks in IW is to make those banks part of the data stream. If the banks are not meant to be part of the data stream as is often the case for Talk_To "$" banks, then they should be placed in a secondary array. Of course this has never been done, but if it had it would have avoided the above conflict because you could always prefer the "$" banks on the data file. In Run II, if we use any language besides FORTRAN77, this use of "$" banks should go away. If the bank is not meant to be written out, then it can be any type of object in memory and a user module may dynamically allocate as many as are needed for each parameter set. 4. Database Information In addition to event data, calibration data is needed by the analysis modules. This data comes from many different files and is generally read into the KALI secondary array. Analysis_Control does not actively manage the KALI array. It initializes it, but does not flush or read into it. Reads are done when user code makes a data base request and can be read into any array the user chooses. However if IW is chosen a copy will have to be made. Since there is no central management of database accesses the user code must be in the Begin_Run entry point. For level3 and production this was a strict requirement. Since this data comes from different files it is natural to have it stored in a different data structure then the main event in memory. For Run II this should become a strict requirement. In this way the analysis driver could more actively maintain the data by making sure it changes at run transitions and even allowing caching of requests. 5. Analysis_Control Generated Information User modules can modify the data stream by modifying the banks in IW. In addition Analysis_Control itself can modify it if directed to do so with run-time user commands. Some of these are TAGC generation on input and TPID and PTHB on output. A common pitfall of TAGC generation is that if there is a run transition in a file without a corresponding begin record the TAGC generation will fail (3). The reason users wanted to drop begin records was that for particularly sparse data sets you may have only one good event (or less) per run, so that if you kept the originally large begin record you would incur a large overhead. The compromise here was to keep the begin records but minimize their size to just the essential information. There is also a BANK_DUMP utility called DROPREC which will drop begin records for runs in which there are no events. 6. Summary Most of the above problems occurred because users wanted I/O features that were not originally designed into the system, so solutions were hacked into the design. Also since FORTRAN77 does not allow run time creation of chunks of memory, the available solutions were limited. In my opinion the whole issue of run by run vs. event by event data management should be rethought for Run II. ---------------------------------------------------------------------------- (1) These guidelines were never written down so most I/O modules are modified versions of the default input module READ_FILE. While this approach works it has some deficiencies. When new features are added to READ_FILE (like new A_C INPUT commands or bug fixes) the copied modules are not modified to reflex the change. Of course if the modules were related through inheritance this would not be a problem. (2) An example is the $SVD bank used in some SVX analyses. (3) Brian Winer's 1A top data set had this problem which made it hard to do trigger studies from it.