CDF Data Handling Usage Guidelines

26 March 2004

WORK IN PROGRESS


Guidelines

  1. DH has finite capabilities, and is used by all. Plan and prepare first.
  2. Know how much data you are processing.
  3. Determine whether your dataset is golden or not.
  4. Determine whether your dataset is in cache or not.
  5. Raw Datasets: special instructions.
  6. B Hadronic Datasets: special instructions.
  7. Deprecated Datasets: special instructions.
  8. If using more than 0.5 TB, 50 filesets, or 500 files not in cache, then request prestaging.
  9. Do not access a data file with more than one CAF segment or concurrent process in your job.
  10. ROOT scripts: Be sure to close datafiles as soon as you are done with them. Do not use TChain::NEvents() to loop over events.
  11. Use the log rule to scale up in developing analyses.
  12. Break up analyses into manageable chunks. Do not re-run an entire analysis due to an error in one chunk.
  13. Test a "new" CAF job on a fileset FIRST.
  14. Be patient, but not too patient.

1. DH has finite capabilities, and is used by all. Plan and prepare first.

        While the CDF Data Handling system has scaled up to meet the challenges of Run II, delivering 10 to 50 TB of data a data to clients, it has its limitations. It does make all CDF data conveniently available on any CDF computing platform, but it is presently limited in its ability to moderate all possible user demands. Any one user could in theory initiate 1600 CAF processes that request tens of data files each. Clearly, DH cannot deliver this  demand for 32 TB of data from tape to cache to client instantly, nor can dCache queue tens of thousands of client requests before its control components begin to suffer memory exhaustion. While DH services are shared by all, largely without discriminiation, this also couples everyone's experience to others' activity. If one user requests 10 TB of data be read from tape to the general cache in dCache, then all general cache requests that follow will have to wait in queue behind these requests.

        While development is on-going to address this, to coordinate data availability and compute job queueing and to decouple user activity where possible, experience has shown that a few simple guidelines will help get analyses be completed sooner and in a more predictable fashion.

       Please subscribe to cdfdh-announce@fnal.gov. This is a low volume e-mail list (open subscription, closed submission) where DH downtimes, service problems, and significant new features are announced.

       Please read through the remaining guidelines. This is a minimal list for a novice to avoid the most costly mistakes and/or frustrating experiences. A little planning now will pay off! To move beyond these guidelines, to find solutions or optimizations for less generic analyses, one can read the CDF dCache documentation, send e-mail to cdfdh_oper@fnal.gov, or chat with the CDF DH, Computing, or Offline experts. We have many features available for the big "problems" that can be used now, and not all are generally used.

Back to Guidelines


2. Know how much data you are processing.

It is important that you know not only the number of events you want to analyze, but also the number of files and the total data volume you are going to process. Data Handling has overheads and limitations based on both the number of files requests (each file requires some specific work to be done, independent of file size) and the data volume (TB) that must be moved. For instance, it takes dCache a finite amount of time to handle 1000 files requests: each filename must be translated via PNFS to a unique id, then a message sent to each of the 100+ pools to ask if they have the file, a reasonable wait for responses, then react to the response content: in cache or not, and so on. If those 1000 files contain 1 event each, then that data analysis will be very slow (calendar time per event) indeed. All of the guidelines below expect that you know these parameters.

The number of files in a dataset: you can use the CDF Database Browser. (Best way?)

The data volume of a dataset: you can use the CDF Database Browser. Request the "DC: Datasets with long statistics" report and select the book name. FILECATALOG is the "main" book, often indicated by a blank name in utilities. Then select "Submit request". Select a datasetid and any other restrictions you want, and "submit request". The total size is in one of the fields in the result table, and is in GB.

Back to Guidelines


3. Determine whether your dataset is golden or not.

Golden Datasets are preloaded into dCache and periodically checked for integrity. Preloading of datasets into cache is for user convenience; all CDF data is accessible in dCache from all CDF computing platforms. That convenience can be crucial however, since you will still be able to access golden datasets even when the Enstore tape restore requests are backlogged.

To see the list of datasets in golden service, go to the main CDF DH web page. The link dCache Golden Datasets Status points to the most recently updated list of datasets, the number of files and data volume of each, as well as the amount of space left in the golden sub-cache.

All requests for a dataset be put into golden service are handled by the Physics Groups, and collected by the Physics Coordinator. CDF DH only adds datasets to golden service on the recommendation of the Physics Coordinator or the CDF Computing management.

If you dataset is golden, then you do not need to contact CDF DH ops before opening large volumes of it. However, the other guidelines listed here still apply.

Back to Guidelines


4. Determine whether your dataset is in cache or not.

You use the utility DH_DCacheCheck to check what fraction of your dataset is in cache. This will help with a guideline below: If most of it is in cache, then you can just submit your CAF job, else you should coordinate with CDF DH to avoid tape request backlogs, long delays in data access, and service problems.

% DH_DCacheCheck -h
will list the command switches that can be used with this utility to check some or all of a dataset, fileset, or file, or to copy the same to a local disk. For example,

% DH_DCacheCheck -d rtop09 -b cdfptop -n 0

will check all filesets (-n 0) of datsetid rtop09 in book cdfptop. The name "filecatalog" or an omitted -b switch will access the "central" FILECATALOG book.

Back to Guidelines


5. Raw Datasets: special instructions.

OLD GUIDELINE TEXT: Do not perform large (>= 20 filesets) raw dataset skimming jobs on the CAF without detailed consultation with the associated Physics Group and the Data Handling Group before-hand. There are many good reasons not to do this anyway, and consulting with experts should help determine a more efficient way to accomplish the same task. dCache was intended to serve production output and secondary data sets. There is only limited space and bandwidth in dCache dedicated to serving raw datasets. Trying to use large numbers of CAF sections to read raw data will simply cause contention for these pools and slow your processing overall. Also, reading large amounts of raw data can be made more efficient with subtle tweaks to input module configuration, something that is not required for production output and secondary data sets. Raw data skimming jobs have at times effectively denied access to analysis jobs reading secondary datasets in dCache, so this is not a theoretical concern. If it has to be done, let us work together to find a way to do so efficiently.

Back to Guidelines


6. B Hadronic Datasets: special instructions.

OLD GUIDELINE TEXT: If you plan to access on of the B Hadronic datasets, hbhd08, hbhd09, or hbhd0c, then please contact cdfdh_oper@fnal.gov and and Marco Rescigno (rescigno@fnal.gov). Marco is coordinated cache access to these very large datasets.. DETAILS: These very large datasets have their own volatile sub-cache defined. Marco and CDF DH are trying coordinate what portion of these datasets is in cache for a period of time to optimize re-use of the files in cache. The working plan is to have about 1/2 of one of these datasets in cache for a week or two, then move on the another 1/2 of a dataset. Note that xbhd0c is golden. Use xbhd0c if you can, instead of hbhd0c.

Back to Guidelines


7. Deprecated Datasets: special instructions.

We will eventually accumulate a list of deprecated datasets, whose use will be restricted by assigning only a small number of pools in dCache to serve them. The only deprecated dataset at this time is gjet08. Use gqcd0g, gqcd1g, gqcd2g, gqcd3g, and/or gqcd4g instead of gjet08. These datasets are split solely on trigger bits and thus contain all the objects and events of gjet08.

Back to Guidelines


8. If using more than 0.5 TB, 50 filesets, or 500 files not in cache, then request prestaging.

OLD GUIDELINE TEXT: If you plan to access a larger non-golden dataset (>= 50 filesets) via dCache for the first time, please contact cdfdh_oper@fnal.gov first. DCache treats all stage requests in a FIFO-like restore queue, whether a request is part of 2000 files requested by one user or part of 2 files requested by another. CDF DH would like to alleviate severe waiting by short jobs through intelligent pre-staging datasets for long jobs before the long jobs are run on the CAF. It is in your interest to warn cdfdh_oper rather than risk exceeding the time limits in the CAF, which would result in losing all your partially completed work. cdfdh_oper offers to pre-stage what you need so that you do not waste your time nor unnecessarily delay other jobs. This sort of tape-access resource management will be provided by SAM in the future, so there has been no effort to graft such onto the dCache service.

Back to Guidelines


9. Do not access a data file with more than one segment or concurrent process in your job.

Examples of multiple processes opening the same file at the same time:
  1. A user submits a CAF job with poor input specifications, and all his/her segments attempt to access the same files. The user should kill the job and fix the input specs.
  2. A user organizes a set of CAF jobs to run on the same data with slightly different module parameters used in each job. The user should re-organize the job to use module cloning or some such mechanism to read the same data only once.
  3. A user organizes a set of CAF jobs to treat individual runs separately, one job per run. This is much less frequent. The user should contact CDF DH to consider a more efficient means of achieving the same goal, given the characteristics of the DH and CAF systems.

OLD GUIDELINE
TEXT: Do not access a data file with more than one section or concurrent process in your job. DETAILS: In order to reduce the calendar time it takes to process a single data file, some users have constructed jobs that divide a data file into run ranges and then have many sections treat a separate piece of the data file. While this is technically feasible in the CDF Framework, the services CDF DH uses for distributed data caching are not compatible with this model. For instance, dCache only allows a certain number of clients to access files in any one pool in order to manage bandwidth in a file server. This model assumes that client are processing data in a particular way, dividing up datasets into files and having each section treating one or more whole files. Jobs that try to read individual data files with more than one process will start to queue up once the limit is reached, but worse they will block all other client access to the 600+ data files in the affected pool. Also, since the caching system assumes each section intends to read the entire file, it detects this pattern as a spike in demand for that file and replicates the file N times to other pools. This needlessly wastes disk space in cache, causing other users' analyses to suffer. These "blockages" have caused a host of other problems as well, such as stalled database connections that are sniped after one hour. Data Handling is done in file units and any data processing done on a finer scale is not supported.

Back to Guidelines


10. ROOT scripts: Be sure to close datafiles as soon as you are done with them. Do not use TChain::NEvents() to loop over events.

DCache can serve ntuple files as well as CDF EDM datafiles, and is used in that mode by many. However, there are some caveats specific to using ROOT scripts with datafiles in dCache.

Be sure to close datafiles as soon as you are done with them: We have seen instances where a single root script opened the same file 20-30 times without closing it. While this is harmless when done on a local file, this is a denial-of-service attack on a dCache pool. We limit the number of files that can be open in a dCache pool to prevent the hardware from being overloaded... 32 is typically the limit for large general cache pools and 64 the limit for golden cache pools. In this case, the script consumed all the open-file quota blocking access to 1.6 TB of datafiles in that pool by any other CDF user. User open requests were queued internally by dCache until an open-file slot became available (which was not going to happen, the script had to be killed). The users simply saw their jobs stall in a file open(), even though that file checked as in cache!

Do not use TChain::NEvents() to loop over events: This may seem harmless when used on a small-scale with local files. However, this can affect all CDF DH users for a time due to its impact. This method causes every file in the chain to be opened, the number of events queried, then closed again. This is a tremendous, near instantaneous, load on dCache just for the sake of driving an event loop. The alternative is to write your event loop using iterators, which then only open each file once to read event data.

(EXAMPLE OF THIS KIND OF LOOP - to be inserted here)

Back to Guidelines


11. Use the log rule to scale up developing analyses.

OLD GUIDELINE TEXT: Please use the "log rule" when developing a physics analysis program. Be sure it runs correctly with a single input file first before running on larger chunks of data. Be sure it runs on a fileset correctly before running it on 10 filesets. And be very sure it works correctly -and- produces all the ntuples and plots you will need for a while before running it on larger chunks of data like 100 filesets and/or a full dataset. The log rule was once an oft-quoted guideline to bringing up analyses to insure precious resources (tape access, CPU cycles) are not needlessly wasted on immature analyses. The economy of some of these resources may have changed nowadays but this still makes sense overall. As more and more physics analyses mature and machine luminosity increases, once plentiful resources will again be in short supply.

Back to Guidelines


12. Break up analyses into manageable chunks. Do not re-run an entire analysis due to an error in one chunk.

OLD GUIDELINE TEXT: Try to break up large processing jobs reading whole datasets (or many tens of filesets) into separate jobs which together span the desired input sample, and submit these jobs separately. If there is a problem or error in one section or chunk of the dataset processed, ONLY re-run on that portion of the data sample. How you divide up the dataset is up to you, but by large blocks of consecutive runs used to be a common practice. Our current Framework-DH system has no automated means of identifying failed sections for re-processing, nor for suspending sections across a computer system downtime. This is something that will be provided in part by SAM in the future. Users must do this manually for now, and it is preferred of course that this be done in a manner that does not re-run already successfully processed sections.

Back to Guidelines


13. Test a "new" CAF job on a fileset FIRST.

OLD GUIDELINE TEXT: When moving a job to the CAF from another environment (fcdfsgi2, your desktop, old CAF to new CAF, etc.), always run it on a fileset first to be sure there are no environmental problems (bad link libraries, typo in tcl scripts) that will lead to widespread job crashes. DETAILS: Each environment has its own idiosyncrosies, and a little caution up front improves efficiency overall. Currently, there are at least 3 different versions of Linux and one of IRIX in the CDF computing environment, and even a "correct" program can crash badly if run against incompatible system libraries.

Back to Guidelines


14. Be patient, but not too patient.

OLD GUIDELINE TEXT: Your executable may stall while trying to open a file in dCache for a number of reasons, some of which are completely normal. If the file must be restored from tape, then tape contention (busy Enstore) may delay that restore for minutes or even hours. Nearly infinite stalls can occur when a user tries to restore a file from tape where that tape is marked NO ACCESS by Enstore (tape is being recovered by hand, can take days). If the Enstore queue is not too long and your job has been waiting on a file open for more than an hour, then it may be prudent to send e-mail to cdfdh_oper@fnal.gov with the job id and if possible the filename being opened. To help us understand the problem, one can set "setenv DCACHE_DEBUG 2" for one's job to produce a paper trail of the dCache client-server messaging. And if it is possible to do so, one should leave a troublesome section running so that dCache operators and developers can investigate the precise state of the server when the client has data delivery problems.

Back to Guidelines


R. D. Kennedy 26 March 2004