From marilyn@bastet Fri Mar 12 17:17:25 1999
Date: Mon, 11 Jan 1999 20:47:41 -0600 (CST)
From: Marilyn Schweitzer <marilyn@bastet>
To: ruth@fnal.gov, mari@fnal.gov, dane@fnal.gov
Cc: run2farms@fnal.gov
Subject: Run II Production Management status

As requested for the Von Ruden Review, enclosed is the status as I see
it.  Comments from you or anyone on the run2farms mailing list is welcome.


Marilyn

======================== Status in form of WBS =============================
    
1    Production Management

1.1    Meetings and Management
       Meeting is now every other Friday from 10:30 to 12:00 in FCC2B

1.2    Provide input to  Hardware Procurements - such that procurement is
       appropriate to the products of Production Management
1.2.1        Input for Farms
             Status: Will be done after Run II prototype is evaluated (mid-
		     February) and before the start of the 1st phase of
                     Run II Farms Hardware procurement (latest date for
		     input is May 1). The CDF/CD/D0 results using the
                     prototype system is vital for being able to provide input
                     to the procurement. Issues of Data Flow are the
                     biggest concern 'cause the impact local disk on worker
                     nodes, network access to Mass Storage, and how the
                     batch system will be used. By the end of January, CDF
		     and D0 will provide updated CPU requirements and new
		     scheduling information for their production ramp up.
		     If PCs are used, industrial shelves rather than racks
		     look quite cost effective and promising. While Linux
		     still is very promising, it has not been easy to get
		     PCs into a production state - being able to get e871
		     efficiency on PC farms up above 80%, accurate Linux
                     performance tools and NFS client/server issues understood
		     would (in mind own mind) clinch PC/Linux viability
	             for Run II Farms.
			
1.2.1        Input for Operations Central Server (E.g. home for centralized
                 accounting, central systems monitoring, central control
                 for backups, and central operations access to other
                 systems. Existing farmx would go away.)
             Status: We (DCS Dept, CSS Group and FCS Group) have discussed
		     this and believe that we are close to starting the
		     deployment in February. The functions served by this host are:
        		a) Central collection point for Fermi Unix Accounting
        		b) Central "launch" point for operators scripts such as
           		   requesting tapedrive cleaning, starting xoper using
			   current OCS, FNALU LSF display
        		c) Xfalive monitoring
        		d) Home for OCS database server that handles system
			   backups hosts (dcdsv0 currently handles this)
        		e) Central backups of OCS database backups (hppc
			   department server currently handles this)
        		f) Central patrol collection point for system
			   monitoring (new function)
        		g) Central point for the "new" OCS operator screen
			   (e.g. potentially Oracle client of fncdug)
        		h) Boot host for operator X-terminals
        		i) Backup boot host for other Fermilab X-terminals
			j) Central syslog server (swatch)
			k) Potentially central server for key passwords (e.g. root)
			   software such as escrow
		      The physical attributes would be:
        		a) fnsf224, a Challenge S, would be quite suitable.
        		b) rename it to fncops
        		c) Upgrade the system disk to IRIX 6.5.2
        		d) Add two 9Gb external disk drives
        		e) Add a Eliant tape drive for backups
        		f) Locate it in FCC1 on house power
			g) Give it 24x7 support (CSS would handle core system
			   FCS would handle FCS does special software and
	                   coordination with DCS)
			h) Limit access strictly to fnal
           
1.3    Evaluate Software Components for Data Delivery and Characterize them
       Status:  So far, not a strong interest in this from CDF/D0 to work
	        on this.
1.3.1        Nile User Interface
1.3.2        RFIO

1.4    FARMS
1.4.1        Farms Batch System (FBS - extension of LSF for Reconstruction and
                     MonteCarlo Farms)
             Status: Prototype (minus scratch disk space allocation) is ready and
                     in use by E871 on PCs running Linux. E871's maximum efficiency
                     is only around 50% when they are actively running. We are
                     working with them to get this up to an acceptable level.
                     We suspect/know that there are some FBS deficiencies for CDF/D0
	             as well, but want CDF/D0 to start using the prototype in
	             ernest before we re-design it on our own. Options are:
			a) Enhance current FBS architecture
			b) Use exclusively LSF if can get LSF license cost down and
			   verify LSFs performance on mock-up of a 300 node cluster
			c) Strip LSF out of FBS archicture and write our own
			   scheduler.
		     Expect CDF/D0 evaluation to be done by February 19th.
		     Scratch disk allocation in paticular is an area that we need
                     to architect once we understand want CDF/D0 need. We would
                     like to have this allocation scheme to replace what we currently
                     use on the FNALU batch system as well.
		     Note, monitoring software under Linux apparently requires the
	             Redhat 2.2 Kernel to work properly.
1.4.1.1            Provide (extended) batch system for scheduling and controlling jobs
1.4.1.2            Provide Scratch Disk Allocation
1.4.1.3            Provide Processor Allocation
1.4.1.4            Provide ability to track job history & system usage
1.4.1.5            Provide software for monitoring of system
1.4.1.6            Provide (extended) batch system for scheduling and controlling jobs
1.4.2        Provide  development & test system
             Status: 14 worker node prototype was delivered to CDF and D0 in November.
		     It was  about 5 weeks later than desired:
			a) 4 weeks due to SCSI vs EIDE disk, panel lights and
                           cooling problems with the 18 systems delivered.
	                b) 1 week due to System Adm problems and staff availability.
1.4.2.1            Procurement and Delivery
1.4.2.2            Installation of OS and Products
1.4.2.3            Ongoing Operation and Support during development

1.5    Batch system 
1.5.1        Purchase, Maintain and Support LSF
	     Status: Some current key points are:
		     a) We have been evaluating LSF 3.2, which is the first
			LSF release that supports Linux.  So far, there are
			no known problems, though the licensing changed which
			is always a nuisance.
		     b) We held a meeting on December 15th with Platform Computing.
			Overall, the meeting was quite a productive exchange of
			ideas.  Platform computing seems will to negotiate to
			some degree the cost, but don't expect anything like what
			CERN got 'cause Platform claims to have not made any
			profit on it.
		     c) Currently, there is no known reason that LSF with not
		        be the commercial batch package of choice for Run II.
		     d) In December, a Run II Batch Software Working Group was
	                for to document requirements/feature for farms and analysis
			systems alike. 
	             e) By mid-March (when the Run II Batch Software Working group
		        report is complete) we should have a good idea as to how
		        many licenses would be required for the farms as well as
		        CDF/D0 analysis systems.

1.6    Construct Extensions for Data Center Services
1.6.1        Operators Interfaces to TapeDrives
	     Status: This includes totally revamping OCS so that it not only serves
                     Run II, but also any other systems currently using OCS in
                     FCC (FT'97, FT'99, FNALU, ACPMAPS, System backups, etc.) The
                     plan is to:
	             a) replace existing DBM database with ORACLE
                        now that we have a site-wide ORACALE license.
                        (Note, this means that other institutions that
                        want/need OCS to use higher level CDF/D0 software
                        would have to deal with the ORACLE issus)
                     b) Support Linux
                     c) Concept of using tape drives in a networked
                        fashion would go away except for viewing statistics
                        and other reports. This would allow us to make the
                        installation and stability of the software more
                        robust
                     d) Existing user feature would be quite similar except for
			the concept of tape drive groups which is flexible to
			 the point almost no one understands it.
	             e) Direct interface to the drive (e.g. to gather statistics)
                        would go through FTT. Thus, OCS functionally closely
                        coupled to FTT in this regard.
                     f) Statistics reporting of tape drive use could be gather
                        to a central database. 
                     g) Statistics Run II tapes managed by enstore and other
                        software could be provided. Enstore and such other
                        software would use the COMMON interface that OCS
                        provides.
                     h) Better integration between the existing tapes database
	                and OCS.  (OCS will NOT replace the existing tapes
                        database)
                     i) Better centralized control for operations (e.g. right
                        now they need to have ~10 little independed screens
                        where on would be far better.)
                     j) Interface to deal with robotics from the Operators
                        perspective. E.g  load a stacker, once-a-day Central
                        MSS loading/removal of tapes.
		    Hope to start on this in mid-February.
	             
1.6.2        Centralized Accounting
	     Status:  Special software needed to provide centralized accounting
	 	      for Linux is complete and integrated in with SGI, AIX,
		      SunOS and OSF1 reports. Note, this means that Linux clients
		      need a portion of this special software and we are working
		      to get in incorporated into the Fermilab Redhat release.
		      Plan to re-vamp existing central accountings tools to be
		      more easily managed, a FUE product, and have
		      centralized/graphic reports.
1.6.3        Management Reporting Tools
	     Status: (I believe this to be redundant with item 1.6.4)
1.6.4        Centralized System Management Tools
             Status: We are working with the patrol product (from DESY/SLAC?)
                   to be used for system status and some automatic system
	           recovery. We want to combine this with some of the xfalive
                   features which would be web based.  We have a local
                   sysmon product (e.g. like xcpsmon) that will run under Linux
		   and is packaged with the farms batch software. This is tk
		   based rather than Motif. As mentioned before, it needs
		   the Linux 2.2 Kernal for proper results.
1.6.4.1            Central Reporting Screen
1.6.4.2            Recommended Compliance Interface for subsystems

1.7    Documentation, Operational Delivery
       Status: No status beyond that the documentation for the Run II
               Prototype Farms Batch System has been progessing quite
               well.  (I estimate it is about 75% done, but keep in
               mind this is for the prototype and may not carry through
               to the real Run II.)
1.7.1        D0 Mock Data Challenge 1
1.7.2        D0 Mock Data Challenge 2
1.7.3        CDF  Mock Data Challenge 1999
1.7.4        Preparation for Operations - acceptance testing

================ Current and Project Effort from FCS Group =================

Enclosed is the CURRENT EFFORT my group has been putting towards
the CD catagories.  They are a little different than what one sees
reported in the division reports 'cause folks have been putting
most of the time they have spent working with e871 using the Run II
batch prototype under Run II rather than Fixed Target.  I've asked them
in future reports to put ANY e871 related work under Fixed Target.

I have also enclosed ESTIMTATED FUTURE EFFORT my group will be putting
towards CD categories. I believe these estimates are rather optimistic
towards what we can spend on Run II.

-------------------------------------------------------------------
|                       |         Current Percent Effort          |
|  Effort Category      |-----------------------------------------|
|-----------------------| TJ | GS | MS | JF | TL | IM | MB |  FTE |
|ACPMAPS Mnt & Opr      | 15 |  5 | 10 |    | 20 | 20 |    |  .70 | 
|ACPMAPS Dev            |    |    |  5 |    |    | 20 |    |  .25 |
|Fixed Target Mnt & Opr | 30 | 55 | 15 | 20 | 10 |    |    | 1.30 |
|Fixed Target Dev       | 20 |    | 10 | 30 | 20 | 20 | 35 | 1.35 |
|Run II Dev             | 20 | 25 | 25 | 35 | 35 | 25 | 50 | 2.15 |
|Dept Adm & Mgt         |    |    | 20 |    |    |    |    |  .20 |
|General                | 15 | 15 | 15 | 15 | 15 | 15 | 15 | 1.05 |
-------------------------------------------------------------------

-------------------------------------------------------------------
|                       |     Estimated Future Percent Effort     |
|  Effort Category      |-----------------------------------------|
|-----------------------| TJ | GS | MS | JF | TL | IM | MB |  FTE |
|ACPMAPS Mnt & Opr      | 15 | 15 |  5 |    | 15 | 15 |    |  .65 | 
|ACPMAPS Dev            |    |    |  5 | 10 | 10 | 20 |    |  .45 |
|Fixed Target Mnt & Opr | 30 | 50 | 10 |  5 | 15 |    | 15 | 1.25 |
|Run II Dev             | 40 | 20 | 45 | 70 | 45 | 50 | 70 | 3.40 |
|Dept Adm & Mgt         |    |    | 20 |    |    |    |    |  .20 |
|General                | 15 | 15 | 15 | 15 | 15 | 15 | 15 | 1.05 |
-------------------------------------------------------------------


