CDF Data Handling Usage Guidelines
26 March 2004
WORK IN PROGRESS
Guidelines
- DH has finite capabilities, and is used by
all.
Plan and prepare first.
- Know how much data you are processing.
- Determine whether your dataset is golden or
not.
- Determine whether your dataset is in cache or
not.
- Raw Datasets: special instructions.
- B Hadronic Datasets: special instructions.
- Deprecated Datasets: special instructions.
- If using more than 0.5 TB, 50 filesets, or 500
files not
in cache, then request prestaging.
- Do not access a data file with more than one
CAF segment or
concurrent process in your job.
- ROOT scripts: Be sure to close datafiles as
soon as you are done with them. Do not use
TChain::NEvents() to loop over events.
- Use the log rule to scale up in developing
analyses.
- Break up analyses into manageable chunks. Do
not re-run
an entire analysis due to an error in one chunk.
- Test a "new" CAF job on a fileset FIRST.
- Be patient, but not too patient.
1. DH has finite capabilities, and is used by
all. Plan and prepare first.
While the CDF Data Handling
system has scaled up to meet the challenges of Run II, delivering 10 to
50 TB of data a data to clients, it has its limitations. It does make
all CDF data conveniently available on any CDF computing platform, but
it is presently limited in its ability to moderate all possible user
demands. Any one user could in theory initiate 1600 CAF processes that
request tens of data files each. Clearly, DH cannot deliver this
demand for 32 TB of data from tape to cache to client instantly, nor
can dCache queue tens of thousands of client requests before its
control components begin to suffer memory exhaustion. While DH services
are shared by all, largely without discriminiation, this also couples
everyone's experience to others' activity. If one user requests 10 TB
of data be read from tape to the general cache in dCache, then all
general cache requests that follow will have to wait in queue behind
these requests.
While development is on-going to
address this, to coordinate data availability and compute job queueing
and to decouple user activity where possible, experience has shown that
a few simple guidelines will help get analyses be completed sooner and
in a more predictable fashion.
Please subscribe to
cdfdh-announce@fnal.gov. This is a low volume e-mail list (open
subscription, closed submission) where DH downtimes, service problems,
and significant new features are announced.
Please read through
the remaining guidelines. This is a minimal list for a novice to
avoid the most costly mistakes and/or frustrating experiences. A little
planning now will pay off! To move beyond these guidelines, to find
solutions or optimizations for less generic analyses, one can read the
CDF dCache documentation, send e-mail to cdfdh_oper@fnal.gov, or chat
with the CDF DH, Computing, or Offline experts. We have many features
available for the big "problems" that can be used now, and not all are
generally used.
Back to Guidelines
2. Know how much data you are processing.
It is important that you know not only the number of events you want to
analyze, but also the number of files and the total data volume you are
going to process. Data Handling has overheads and limitations based on
both the number of files requests (each file requires some specific
work to be done, independent of file size) and the data volume (TB)
that must be moved. For instance, it takes dCache a finite amount of
time to handle 1000 files requests: each filename must be translated
via PNFS to a unique id, then a message sent to each of the 100+ pools
to ask if they have the file, a reasonable wait for responses, then
react to the response content: in cache or not, and so on. If those
1000 files contain 1 event each, then that data analysis will be very
slow (calendar time per event) indeed. All of the guidelines below
expect that you know these parameters.
The number of files in a dataset:
you can use the CDF
Database Browser. (Best way?)
The data volume of a dataset:
you can use the CDF
Database Browser. Request the "DC: Datasets with long statistics"
report and select the book name. FILECATALOG is the "main" book, often
indicated by a blank name in utilities. Then select "Submit request".
Select a datasetid and any other restrictions you want, and "submit
request". The total size is in one of the fields in the result table,
and is in GB.
Back to Guidelines
3. Determine whether your dataset is golden
or not.
Golden Datasets are preloaded into dCache and periodically checked for
integrity. Preloading of datasets into cache is for user convenience;
all CDF data is accessible in dCache from all CDF computing platforms.
That convenience can be crucial however, since you
will still be able to access golden datasets even when the Enstore tape
restore requests are backlogged.
To see the list of datasets in golden service, go to the main CDF DH
web page. The link dCache
Golden Datasets Status points to the most
recently updated list of datasets, the number of files and data volume
of each, as well as the amount of space left in the golden sub-cache.
All requests for a dataset be put into golden service are handled by
the Physics Groups, and collected by the Physics Coordinator. CDF DH
only adds datasets to golden service on the recommendation of the
Physics Coordinator or the CDF Computing management.
If you dataset is golden, then you do not need to contact CDF DH ops
before opening large volumes of it. However, the other guidelines
listed here still apply.
Back to Guidelines
4. Determine whether your dataset is in cache
or not.
You use the utility DH_DCacheCheck to check what fraction of your
dataset is in cache. This will help with a guideline below: If most of
it is in cache, then you can just submit your CAF job, else you should
coordinate with CDF DH to avoid tape request backlogs, long delays in
data access, and service problems.
% DH_DCacheCheck -h
will list the command switches that can be used with this utility to
check some or all of a dataset, fileset, or file, or to copy the same
to a local disk. For example,
% DH_DCacheCheck -d rtop09 -b cdfptop -n 0
will check all filesets (-n 0) of datsetid rtop09 in book cdfptop. The
name "filecatalog" or an omitted -b switch will access the "central"
FILECATALOG book.
Back to Guidelines
5. Raw Datasets: special instructions.
OLD GUIDELINE TEXT: Do not
perform large (>= 20 filesets) raw dataset skimming jobs on the
CAF without detailed consultation with the associated Physics Group and
the
Data Handling Group before-hand. There are many good reasons not to do
this anyway, and
consulting with experts should help determine a more efficient way to
accomplish the same task. dCache was intended to serve production
output and
secondary data sets. There is only limited space and bandwidth in
dCache
dedicated to serving raw datasets. Trying to use large numbers of CAF
sections
to read raw data will simply cause contention for these pools and slow
your
processing overall. Also, reading large amounts of raw data can be made
more
efficient with subtle tweaks to input module configuration, something
that is
not required for production output and secondary data sets. Raw data
skimming
jobs have at times effectively denied access to analysis jobs reading
secondary
datasets in dCache, so this is not a theoretical concern. If it has to
be done,
let us work together to find a way to do so efficiently.
Back to Guidelines
6. B Hadronic Datasets: special instructions.
OLD GUIDELINE TEXT:
If you plan to access on of the B Hadronic datasets, hbhd08, hbhd09, or
hbhd0c, then please contact cdfdh_oper@fnal.gov and and Marco Rescigno
(rescigno@fnal.gov). Marco is coordinated cache access to these very
large
datasets.. DETAILS: These very large datasets have their own volatile
sub-cache
defined. Marco and CDF DH are trying coordinate what portion of these
datasets
is in cache for a period of time to optimize re-use of the files in
cache. The
working plan is to have about 1/2 of one of these datasets in cache for
a week
or two, then move on the another 1/2 of a dataset. Note that xbhd0c is golden. Use xbhd0c if
you can, instead of hbhd0c.
Back to Guidelines
7. Deprecated Datasets: special instructions.
We will eventually accumulate a list of deprecated datasets, whose use
will be restricted by assigning only a small number of pools in dCache
to serve them. The only deprecated dataset at this time is gjet08. Use gqcd0g,
gqcd1g, gqcd2g, gqcd3g, and/or gqcd4g instead of gjet08. These
datasets are split
solely on trigger bits and thus contain all the objects and events of
gjet08.
Back to Guidelines
8. If using more than 0.5 TB, 50 filesets, or
500 files not
in cache, then request prestaging.
OLD GUIDELINE TEXT:
If you plan to access a larger non-golden dataset (>= 50 filesets)
via dCache for the first time, please contact cdfdh_oper@fnal.gov
first. DCache treats all stage requests in a FIFO-like restore queue,
whether a request is part of 2000 files requested by one user or part
of 2
files requested by another. CDF DH would like to alleviate severe
waiting by
short jobs through intelligent pre-staging datasets for long jobs
before the
long jobs are run on the CAF. It is in your interest to warn cdfdh_oper
rather
than risk exceeding the time limits in the CAF, which would result in
losing
all your partially completed work. cdfdh_oper offers to pre-stage what
you need
so that you do not waste your time nor unnecessarily delay other jobs.
This
sort of tape-access resource management will be provided by SAM in the
future,
so there has been no effort to graft such onto the dCache service.
Back to Guidelines
9. Do not access a data file with more than
one segment or
concurrent process in your job.
Examples of multiple processes opening the same file at the same time:
- A user submits a CAF job with poor input specifications, and all
his/her segments attempt to access the same files. The user should kill
the job and fix the input specs.
- A user organizes a set of CAF jobs to run on the same data with
slightly different module parameters used in each job. The user should
re-organize the job to use module cloning or some such mechanism to
read the same data only once.
- A user organizes a set of CAF jobs to treat individual runs
separately, one job per run. This is much less frequent. The user
should contact CDF DH to consider a more efficient means of achieving
the same goal, given the characteristics of the DH and CAF systems.
OLD GUIDELINE TEXT:
Do not access a data file with more than one section or
concurrent
process in your job. DETAILS: In order to reduce the calendar time it
takes to process a
single data file, some users have constructed jobs that divide a data
file into
run ranges and then have many sections treat a separate piece of the
data file.
While this is technically feasible in the CDF Framework, the services
CDF DH
uses for distributed data caching are not compatible with this model.
For
instance, dCache only allows a certain number of clients to access
files in any
one pool in order to manage bandwidth in a file server. This model
assumes that
client are processing data in a particular way, dividing up datasets
into files
and having each section treating one or more whole files. Jobs that try
to read
individual data files with more than one process will start to queue up
once
the limit is reached, but worse they will block all other client access
to the
600+ data files in the affected pool. Also, since the caching system
assumes
each section intends to read the entire file, it detects this pattern
as a
spike in demand for that file and replicates the file N times to other
pools.
This needlessly wastes disk space in cache, causing other users'
analyses to
suffer. These "blockages" have caused a host of other problems as well,
such as
stalled database connections that are sniped after one hour. Data
Handling is
done in file units and any data processing done on a finer scale is not
supported.
Back to Guidelines
10. ROOT scripts: Be sure to close datafiles
as soon as you are done with them. Do not use
TChain::NEvents() to loop over events.
DCache can serve ntuple files as well as CDF EDM datafiles, and is used
in that mode by many. However, there are some caveats specific to using
ROOT scripts with datafiles in dCache.
Be sure to close datafiles as soon as
you are done with them: We have seen instances where a single
root script opened the same file 20-30 times without closing it. While
this is harmless when done on a local file, this is a denial-of-service
attack on a dCache pool. We limit the number of files that can be open
in a dCache pool to prevent the hardware from being overloaded... 32 is
typically the limit for large general cache pools and 64 the limit for
golden cache pools. In this case, the script consumed all the open-file
quota blocking access to 1.6 TB of datafiles in that pool by any other
CDF user. User open requests were queued internally by dCache until an
open-file slot became available (which was not going to happen, the
script had to be killed). The users simply saw their jobs stall in a
file open(), even though that file checked as in cache!
Do not use TChain::NEvents() to loop
over events: This may seem harmless when used on a small-scale
with local files. However, this can affect all CDF DH users for a time
due to its impact. This method causes every
file in the chain to be opened, the number of events queried, then
closed again. This is a tremendous, near instantaneous, load on dCache
just for the sake of driving an event loop. The alternative is to write
your event loop using iterators, which then only open each file once to
read event data.
(EXAMPLE OF THIS KIND OF LOOP - to be inserted here)
Back to Guidelines
11. Use the log rule to scale up developing
analyses.
OLD GUIDELINE TEXT:
Please use the "log rule" when developing a physics analysis
program.
Be sure it runs correctly with a single input file first before running
on
larger chunks of data. Be sure it runs on a fileset correctly before
running it
on 10 filesets. And be very sure it works correctly -and- produces all
the
ntuples and plots you will need for a while before running it on larger
chunks
of data like 100 filesets and/or a full dataset. The log rule was once
an oft-quoted guideline to bringing up
analyses to insure precious resources (tape access, CPU cycles) are not
needlessly wasted on immature analyses. The economy of some of these
resources
may have changed nowadays but this still makes sense overall. As more
and more
physics analyses mature and machine luminosity increases, once
plentiful
resources will again be in short supply.
Back to Guidelines
12. Break up analyses into manageable chunks.
Do not re-run
an entire analysis due to an error in one chunk.
OLD GUIDELINE TEXT:
Try to break up large processing jobs reading whole datasets (or
many
tens of filesets) into separate jobs which together span the desired
input
sample, and submit these jobs separately. If there is a problem or
error in one
section or chunk of the dataset processed, ONLY re-run on that portion
of the
data sample. How you divide up the dataset is up to you, but by large
blocks of
consecutive runs used to be a common practice. Our current Framework-DH
system has no automated means of
identifying failed sections for re-processing, nor for suspending
sections
across a computer system downtime. This is something that will be
provided in
part by SAM in the future. Users must do this manually for now, and it
is
preferred of course that this be done in a manner that does not re-run
already
successfully processed sections.
Back to Guidelines
13. Test a "new" CAF job on a fileset FIRST.
OLD GUIDELINE TEXT:
When moving a job to the CAF from another environment (fcdfsgi2,
your
desktop, old CAF to new CAF, etc.), always run it on a fileset first to
be sure
there are no environmental problems (bad link libraries, typo in tcl
scripts)
that will lead to widespread job crashes. DETAILS: Each environment has
its own idiosyncrosies, and a little
caution up front improves efficiency overall. Currently, there are at
least 3
different versions of Linux and one of IRIX in the CDF computing
environment,
and even a "correct" program can crash badly if run against
incompatible system
libraries.
Back to Guidelines
14. Be patient, but not too patient.
OLD
GUIDELINE TEXT: Your
executable may stall while trying to open a file in
dCache for a number of reasons, some of which are completely normal. If
the
file must be restored from tape, then tape contention (busy Enstore)
may delay
that restore for minutes or even hours. Nearly infinite stalls can
occur when a
user tries to restore a file from tape where that tape is marked NO
ACCESS by
Enstore (tape is being recovered by hand, can take days). If the
Enstore queue is not too long and your job has been waiting on a
file open for more than an hour, then it may be prudent to send e-mail
to
cdfdh_oper@fnal.gov with the job id and if possible the filename being
opened.
To help us understand the problem, one can set "setenv DCACHE_DEBUG 2"
for
one's job to produce a paper trail of the dCache client-server
messaging. And
if it is possible to do so, one should leave a troublesome section
running so
that dCache operators and developers can investigate the precise state
of the
server when the client has data delivery problems.
Back to Guidelines
R. D.
Kennedy 26 March 2004