CDF DH At A Glance Notes

The CDF Data Handling At A Glance page provides a status summary of the four main parts of the CDF data handling system: raw data logging, SAM, dCache and Enstore. All information on the page is generated by processes running under user "sam" on fcdfsam4.fnal.gov. The page itself is updated once every 5 minutes by a cron job:

The cron job invokes four scripts: make_rawdata.py, make_sam.py, make_dcache.py and make_enstore.py, to generate the corresponding parts for the subsystems.

  1. Raw Data Logging

    The make_rawdata.py script fetches the status summary of the CSL system and displays "OK", "Warning" or "Error" accordingly.

    It also checks encp_history.txt as generated by make_enstore.py (see Enstore section) and shows the data rate to Enstore during the previous 5 minutes.

    The "(Details)" text is a link to the detailed page of CSL to Enstore Status.

  2. SAM

    1. Database Response Time

      This response time is measured once every 5 minutes by a cron job:

      • 2-57/5 * * * * ${HOME}/stager_tools/DHAtAGlance/bin/db_response/probe_db.bash

      This script invokes script probe_db.ksh which uses the DB_RESPONSE Oracle account to make database queries via sql.

      Every 5 minutes, the script makes four queries to the prd database and records the total time it takes. The four queries were designed by Randolf Herber based on queries used in the database browser. They are:

      1. find the size and CRC information about "ar031aa8.0001phys" by joining the "data_files" and "crc_types" tables;
      2. check the full_path for "ar031aa8.0001phys" by joining the "data_files", "data_file_locations" and "data_storage_locations" tables;
      3. find the total number of files in dataset "hphysr" and take a snapshot by joining the "data_files", "project_definitions", "project_snapshots" and "project_files" tables;
      4. find the detailed metadata about file "ar031aa8.0001phys" by joining the "data_files", "data_files_runs", "data_file_locations", "data_storage_locations", "file_content_statuses", "data_tiers", "application_families", "physical_datastreams", "logical_datastreams", "process_types" and "processes" tables.

      The result is updated plots at CDF Production Database Response Time.

    2. Database usage (linked under CDF Production Database Response Time page)

      The users of the database are recorded every 10 minutes by a cron job:

      • 4-54/10 * * * * ${HOME}/stager_tools/DHAtAGlance/bin/db_usage/db_usage.bash

      The information collected is useful in spotting queries that have been running for a very long time.

      This script invokes script db_usage.py which uses the DB_RESPONSE Oracle account to make database queries via sql.

      The result is updated in the text page Text dump of the CDF Offline Production Oracle database connections,
      and the time of the longest active query is recorded in plots at Longest active query to the CDF Offline Production Oracle database.

    3. SAM Database Servers Not Responding

      The status of the SAM database servers are recorded every 15 minutes by a cron job:

      • 4-49/15 * * * * ${HOME}/stager_tools/DHAtAGlance/bin/sam_dbs_info/make_sam_dbs_info.bash

      where the "server_list*" files contain a list of SAM DB servers to be monitored, and "5" is the time for the test, in minutes.

      For each DB server to be monitored, a fake file needs to be declared to SAM. An example file, containing instructions in comments, can be found at stager_tools/DHAtAGlance/bin/sam_dbs_info/fake_file_for_dbs_checking.py .

      The sam_dbs_info.py script creates a separate child thread for each DbServer, and uses a message queue to coordinate between the main thread and the child threads. Each child thread waits on the message queue for a trigger, then executes a "sam locate ..." command, executes a "sam get dbserver connection count ..." command, and records the number of DbServer connections and the total time spent. The main thread checks whether any of the previous commands got stuck, kill them if any, then sends a trigger message to each of the child threads.

    4. SAM Station for CAF To Open / Opened / Closed

      This is a poor-man's monitoring of the CAF station's activities. The information is collected by a cron job:

      • 3-58/5 * * * * ${HOME}/stager_tools/DHAtAGlance/bin/cafstation_rates/cafstation_rates.sh
      which parses the log files of the sam logger process log_cdf_prd.

    5. SAM Station for Farm To Open / Opened / Closed

      This is a poor-man's monitoring of the farm station's activities. The information is collected by a cron job:

      • 4-59/5 * * * * ${HOME}/stager_tools/DHAtAGlance/bin/farmstation_rates/farmstation_rates.sh
      which parses the log files of the sam logger process log_cdf_prd.

    6. Farm Nodes To Enstore

      This is the total writing rate from the farm to Enstore (see below).

    7. Upload Server

      This is a table showing the status of the SAM upload system, currently running on fcdfdata321/322. Via scp, the following text files are fetched from /project/sam/work on the upload servers: heartbeat of and errors from the upload process, and the hourly, daily, weekly, monthly and yearly statistics.

    8. SAM project statistics

      The failure rates of SAM projects are measured once every 10 minutes by cron jobs:

      • 2-52/10 * * * * ${HOME}/stager_tools/sam_statistics/bin/project_statistics/samProjectStatistics.bash
      • 3-53/10 * * * * ${HOME}/stager_tools/sam_statistics/bin/test_project_statistics/samTestProjectStatistics.bash

      These scripts invoke subscripts getProjectStatistics.sh and getTestProjectStatistics.sh, respectively, which use the SAM_PROJECT_STATISTICS Oracle account to query the databse via sqlplus.

    9. Check for inconsistencies in SAM

      Inconsistencies in file size and Enstore CRC values in SAM are checked each week day by cron jobs running under user "sam" on cdfsamweb.fnal.gov:

      • 15 6 * * 1-5 ${HOME}/bin/sam_file_size/check_file_size_and_CRC.ksh

      This script uses the SAMREAD Oracle account to query the database via sqlplus.

    The make_sam.py process builds up the SAM block from the information gathered above.
  3. Dcache

    The make_dcache.py process builds up the Dcache block:

    1. No. of cells Inactive / OFFLINE

      The number of cells Inactive (missing) is obtained by comparing the list of cells in service against the list of registered cells

      The number of cells OFFLINE is the number of cells marked "OFFLINE" on the cells in service page.

    2. No. of doors NOT responding

      The status of the dCache doors are recorded every 10 minutes by a cron job:

      • 2-52/10 * * * * ${HOME}/stager_tools/DHAtAGlance/bin/dcache/make_ping_dcache_doors.bash

      where "door_list.txt" is a list of Dcache doors to be monitored, and "5" is the time interval between tests, in minutes.

      This job is a clone of the SAM-DB-Servers-Not-Responding monitoring. The probe is now the response time of checking the availability of a particular raw data file using:

      • dccp -P dcap://cdfdca?.fnal.gov:251??/pnfs/fnal.gov/usr/cdfen/filesets/CA/CA30/CA3000/CA3000.0/ar017983.0001phys
    3. Diskpool doors NOT responding

      The status of the diskpool doors are recorded every 15 minutes by a cron job:

      • 3-48/15 * * * * ${HOME}/stager_tools/DHAtAGlance/bin/diskpool/make_ping_diskpool_doors.bash

      where "door_list.txt" is a list of diskpool doors to be monitored, and "5" is the time interval between tests, in minutes.

      This is another clone of the SAM-DB-Servers-Not-Responding monitoring. The probe is to check the availability of a test file using:

      • dc_check dcap://fcdfrdc3.fnal.gov:221??/pnfs/diskpool/test/ping
    4. Logins Total / Active / GridFTP / Killed

      Distributions of the logins as listed on the Door Logins page, grouped in various ways.

    5. Waiting for Door / Pool / Pnfs / Unknown

      Distributions of the logins as listed on the Door Logins page, grouped in various ways.

    6. Door with the oldest last-login

      Distributions of the logins as listed on the Door Logins page, grouped in various ways.

    7. Door with the most logins

      Distributions of the logins as listed on the Door Logins page, grouped in various ways.

    8. Datasets with the most logins

      Distributions of the logins as listed on the Door Logins page, grouped in various ways.

    9. Users with the most logins

      Distributions of the logins as listed on the Door Logins page, grouped in various ways.

    10. Top dataset:user pairs

      Distributions of the logins as listed on the Door Logins page, grouped in various ways.

    11. Total number of files to restore

      Statistics of the file restore requests as listed on the Restore Monitor page compared to the list on the Pool Request Queues page.

    12. Pool with the most file restores

      Statistics of the file restore requests as listed on the Restore Monitor page.

    13. Pool with the oldest file restore

      Statistics of the file restore requests as listed on the Restore Monitor page.

    14. Space Free / Total (GB)

      This is a summary of the information as shown on the Disk Space Usage page, grouped according to the listings under the Registered Pool Groups page.

    15. Read Active / Queued

      Statistics of the Read requests as listed on the Pool Request Queues, grouped by the pool groups.

    16. Restore Active / Queued

      Statistics of the Restore requests as listed on the Pool Request Queues, grouped by the pool groups.

  4. Enstore

    The make_enstore.py process builds up the Enstore block, which has two tables:

    1. Status of The Libraries

      This table is built based on the Enstore Server Status page, the Movers Page, the Noaccess page and the Volume Quotas page.

      Each library's Status is taken directly from the Enstore Server Status page. Bad movers are identified from the Movers Page and by comparing what each mover is doing, as reported on the libraries' full listings at CDF-9940B, CDF-LTO3, CDF-LTO4F1, and CDF-LTO4G1 with what was reported in the previous cycle. The busiest families list the top 3 file families with the most requests as reported on the full listings, read and write, active and pending.

    2. Mover Utilizations

      This is a summary of the read and write requests to enstore, grouped by the subsystems, together with the most recent total read and write rates. The requests are taken from the libraries' full listings, the rate information is taken from the Encp History page. The mapping of nodes to subsystems are determined by node lists Rawdata_Write_Nodes.list, Farm_Write_Nodes.list, Ntuple_Write_Nodes.list, SAM_Write_Nodes.list, General_Read_Nodes.list, and BPhys_Read_Nodes.list in ${HOME}/stager_tools/DHAtAGlance/config, and by the Dcache pool listings at Raw/Farm Read, General Read, Stage Read and B Phys Read.