CDF DH At A Glance Notes |
The CDF Data Handling At A Glance page provides a status summary of the four main parts of the CDF data handling system: raw data logging, SAM, dCache and Enstore. All information on the page is generated by processes running under user "sam" on fcdfsam4.fnal.gov. The page itself is updated once every 5 minutes by a cron job:
make_rawdata.py,
make_sam.py, make_dcache.py
and make_enstore.py,
to generate the corresponding parts for the subsystems.
The make_rawdata.py script fetches the
status summary of the CSL system and
displays "OK", "Warning" or "Error" accordingly.
It also checks encp_history.txt as generated by
make_enstore.py (see
Enstore section)
and shows the data rate to Enstore during the previous 5 minutes.
The "(Details)" text is a link to the detailed page of CSL to Enstore Status.
This response time is measured once every 5 minutes by a cron job:
This script invokes script probe_db.ksh
which uses the DB_RESPONSE Oracle account to make database queries
via sql.
Every 5 minutes, the script makes four queries to the prd database and records the total time it takes. The four queries were designed by Randolf Herber based on queries used in the database browser. They are:
ar031aa8.0001phys" by
joining the "data_files" and "crc_types" tables;
full_path for
"ar031aa8.0001phys"
by joining the
"data_files", "data_file_locations" and "data_storage_locations"
tables;
ar031aa8.0001phys" by
joining the "data_files", "data_files_runs",
"data_file_locations",
"data_storage_locations", "file_content_statuses", "data_tiers",
"application_families", "physical_datastreams",
"logical_datastreams", "process_types" and "processes" tables.
The result is updated plots at CDF Production Database Response Time.
The users of the database are recorded every 10 minutes by a cron job:
The information collected is useful in spotting queries that have been running for a very long time.
This script invokes script db_usage.py
which uses the DB_RESPONSE Oracle account to make database queries
via sql.
The result is updated in the text page
Text dump of the CDF Offline Production Oracle database connections,
and the time of the longest active query is recorded in plots at
Longest active query to the CDF Offline Production Oracle database.
The status of the SAM database servers are recorded every 15 minutes by a cron job:
where the "server_list*" files contain a list of SAM
DB servers to be monitored, and
"5" is the time for the test, in minutes.
For each DB server to be monitored, a fake file needs to be
declared to SAM. An example file, containing instructions in
comments, can be found at
stager_tools/DHAtAGlance/bin/sam_dbs_info/fake_file_for_dbs_checking.py
.
The sam_dbs_info.py script creates a separate child
thread for each DbServer, and uses a message queue to coordinate
between the main thread
and the child threads. Each child thread waits on the message
queue for a trigger, then executes a "sam locate ..."
command, executes a "sam get dbserver
connection count ..." command,
and records the number of DbServer connections and
the total time spent. The main thread checks
whether any of the previous commands got stuck, kill them if any,
then sends a trigger message to each of the child threads.
This is a poor-man's monitoring of the CAF station's activities. The information is collected by a cron job:
log_cdf_prd.
This is a poor-man's monitoring of the farm station's activities. The information is collected by a cron job:
log_cdf_prd.
This is the total writing rate from the farm to Enstore (see below).
This is a table showing the status of the SAM upload system,
currently running on fcdfdata321/322. Via scp, the following
text files are fetched from /project/sam/work on
the upload servers: heartbeat of and errors from the upload
process, and the hourly, daily, weekly, monthly and yearly
statistics.
The failure rates of SAM projects are measured once every 10 minutes by cron jobs:
These scripts invoke subscripts getProjectStatistics.sh
and getTestProjectStatistics.sh, respectively,
which use the SAM_PROJECT_STATISTICS Oracle account to query the
databse via sqlplus.
Inconsistencies in file size and Enstore CRC values in SAM are checked each week day by cron jobs running under user "sam" on cdfsamweb.fnal.gov:
This script uses the SAMREAD Oracle account to query the database via sqlplus.
The make_dcache.py process builds up the Dcache block:
The number of cells Inactive (missing) is obtained by comparing the list of cells in service against the list of registered cells
The number of cells OFFLINE is the number of cells marked "OFFLINE" on the cells in service page.
The status of the dCache doors are recorded every 10 minutes by a cron job:
where "door_list.txt" is a list of Dcache doors
to be monitored, and "5" is
the time interval between tests, in minutes.
This job is a clone of the SAM-DB-Servers-Not-Responding monitoring. The probe is now the response time of checking the availability of a particular raw data file using:
The status of the diskpool doors are recorded every 15 minutes by a cron job:
where "door_list.txt" is a list of diskpool doors
to be monitored, and "5" is
the time interval between tests, in minutes.
This is another clone of the SAM-DB-Servers-Not-Responding monitoring. The probe is to check the availability of a test file using:
Distributions of the logins as listed on the Door Logins page, grouped in various ways.
Distributions of the logins as listed on the Door Logins page, grouped in various ways.
Distributions of the logins as listed on the Door Logins page, grouped in various ways.
Distributions of the logins as listed on the Door Logins page, grouped in various ways.
Distributions of the logins as listed on the Door Logins page, grouped in various ways.
Distributions of the logins as listed on the Door Logins page, grouped in various ways.
Distributions of the logins as listed on the Door Logins page, grouped in various ways.
Statistics of the file restore requests as listed on the Restore Monitor page compared to the list on the Pool Request Queues page.
Statistics of the file restore requests as listed on the Restore Monitor page.
Statistics of the file restore requests as listed on the Restore Monitor page.
This is a summary of the information as shown on the Disk Space Usage page, grouped according to the listings under the Registered Pool Groups page.
Statistics of the Read requests as listed on the Pool Request Queues, grouped by the pool groups.
Statistics of the Restore requests as listed on the Pool Request Queues, grouped by the pool groups.
The make_enstore.py process builds up the Enstore
block, which has two tables:
This table is built based on the Enstore Server Status page, the Movers Page, the Noaccess page and the Volume Quotas page.
Each library's Status is taken directly from the Enstore Server Status page. Bad movers are identified from the Movers Page and by comparing what each mover is doing, as reported on the libraries' full listings at CDF-9940B, CDF-LTO3, CDF-LTO4F1, and CDF-LTO4G1 with what was reported in the previous cycle. The busiest families list the top 3 file families with the most requests as reported on the full listings, read and write, active and pending.
This is a summary of the read and write requests to enstore, grouped
by the subsystems, together with the most recent total read and write
rates. The requests are taken from the libraries' full listings, the
rate information is taken from the
Encp History page. The mapping of
nodes to subsystems are determined by node lists
Rawdata_Write_Nodes.list,
Farm_Write_Nodes.list,
Ntuple_Write_Nodes.list,
SAM_Write_Nodes.list,
General_Read_Nodes.list, and
BPhys_Read_Nodes.list in
${HOME}/stager_tools/DHAtAGlance/config,
and by the Dcache pool listings at
Raw/Farm Read,
General Read,
Stage Read and
B Phys Read.