Risk Analysis for Level 2 upgrade 23 Sept 2002 We summarize the risks of the Level 2 upgrade project and plans to minimize risk. Most of the risk discussion centers around the commissioning and testing of the Pulsar board, which is the main component of the system. The system relies on much commercial hardware: the only custom hardware is the Pulsar board, two simple mezzanine cards and a simple AUX card. The use of commercial hardware minimizes risk associated with development of custom boards. The Pulsar board is in the prototype stage, and will be thoroughly tested in the coming months (WBS lines 1.3.2.2 and 1.3.2.3). The Pulsar board has been designed, built, and is currently being tested as the central component of a Level-2 trigger teststand which will be utilized for the remainder of Run 2a. Hardware problems with the Pulsar board should be minimized due to extensive board-level simulation done before production. The board design was optimized based on actual core firmware and the board schematics were validated based on careful multi-board-level simulation to verify every connection and interface to and from the board as well as core functionality. The board layout has been verified with careful trace and cross-talk analysis. The firmware for this board is highly modular, which will expedite writing and debugging. In addition, a Pulsar board can be used to test another Pulsar board, as the board can source and sink all types of Level 2 data. Since a few Pulsar boards can be used to drive the entire Level 2 system, the testing and commissioning stage should be considerably shortened with respect to that of the Run 2A system. The commercial components of the system have a low risk associated with them. The S-link technology used in the L2 upgrade system has been developed at CERN, and is used by other experiments such as Atlas. Knowledge on these boards is easily transferrable to and from the LHC community. Finally, this project is not on the critical path and does not require access to the detector, so the risk of schedule slippage does not impact the overall project schedule. Since none of the subsystems upstream of the Level 2 Decision crate need to be modified for the upgrade, the risk involved in switching to the upgrade system is minimized -- it is always possible to temporarily fall back to the Run 2A system if problems arise with the upgrade L2 system. The risk factors are calculated as described in the document "Run IIb CDF Detector Project : Project Management Plan", version 2.3, Rev date: 27 June 2002. We analyze the tasks which are most likely to be cost/schedule drivers and/or tasks believed to be the most technically challenging. The most critical tasks will be the commissioning of the prototype Pulsar board and associated mezzanine and data link boards. The production-level testing will not have a high risk associated with it, as the risk will be accounted for in the prototype testing stage. We evaluate the cost, schedule, and technical risks. We do not calculate scope risks, as we are discussing the baseline upgrade and scope reductions would only be imposed to mitigate catastrophic overruns of cost or schedule. The impact factors are calculated based on a total cost of $215K and a duration of 551 days (110 weeks, or 26 months). The Project Management Plan gives guidelines for calculating the impact factor according to the estimated cost and time, and for the Level 2 project this table gives the impact factors for cost and schedule; technical impact is as given in the document. Project Very Low Low Moderate High Very High Objective 0.05 0.1 0.2 0.4 0.8 ------------------------------------------------------------------------- Cost insig. <$11K $11-22K $22-44K >$44K Schedule insig. <28 days 28-56 days 56-112 days >112 days Next we evaluate the critical tasks, in the order that they appear in the schedule. 1.3.2.2 Testing and Software work existing L2 Pulsar test stand: The cost is covered in an existing project so is not considered here. However, if this effort takes longer than the scheduled four months, it will impact the overall schedule. The biggest risk here is having the needed manpower, and this could cause schedule delay. Problems with the hardware could also necessitate more debugging and engineering time, although as explained in the beginning, much has been done to minimize this risk. We also note that schedule and manpower estimates have been based on experience with the Run 2a system, so there is a high probability that the estimates are accurate. We calculate the Schedule Risk by assigning a 50% overrun due to either and assign a probability of 20%, giving a total risk factor of 0.2 * 0.2 = 0.04. Mitigation can be achieved by adding manpower early in the project if problems start to arise, both to bolster debugging efforts and ensure overlap of manpower. The Technical Risk can be due to problems uncovered in debugging that degrade the performance of the Pulsar system. The probability of these problems occurring is small due to the extensive simulation already done and experience with the current Level 2 system interfaces. We assign a probability of 10% and an Impact factor of 0.8 (Very High, if the system is effectively useless for mission), giving a risk factor of 0.08. Mitigation of this is built into the schedule as engineering time for the Pre-Production L2 System (WBS 1.3.2.4.2). 1.3.2.3 Commission L2 Pulsar for each data path: There are nine data paths which communicate with the L2 system, and since these are existing paths the probability that they can not be made to eventually work is small. However, schedule slippage could occur due to extra time needed to debug or add additional resources to the project. We calculate the Schedule Risk by assigning a 50% overrun due to either overcoming problems with the interfaces or adding additional resources (leading to a 3.5 month delay, or a High impact factor of 0.4). The probability we assign is 30%, due to the many data paths involved. The total Schedule risk factor is 0.3 * 0.4 = 0.12. Mitigation can be achieved by adding manpower early in the project if problems arise, both to bolster debugging efforts and ensure overlap of manpower. Additionally, it is planned to test the interfaces in the first phase of commissioning the Pulsar test stand, so that problems can be caught as soon as possible. The Technical Risk can be due to problems uncovered in debugging that make a datapath essentially unworkable. The probability of these problems occurring is small due to experience with the current Level 2 system interfaces. The hardware interfaces for each data path are well-understood, and much of the functionality is done in firmware. The engineers working on the Pulsar board have experience with some of the data paths. We assign a probability of 10% and an Impact factor of 0.8 (Very High, if the system is effectively useless for mission), giving a risk factor of 0.08. Mitigation of this is built into the schedule as engineering time for the Pre-Production L2 System (WBS 1.3.2.4.2). Additionally, there is a backup solution where data from a specific path can be taken from a Run 2A interface board by a transition board and converted into a more general format for input to the Pulsar. This solution could be used as a temporary method for moving forward with commissioning, and only part of the permanent solution in an extreme case. Most of the rest of the entries in the WBS dictionary would have low risk probabilities assigned to them because potential problems should have been found in the test stand phase. Preproduction Run of Pulsar L2 System: 1.3.2.4.2 Engineering on preproduction L2 system: No cost associated, so no Cost Risk is assigned. Schedule slippage could be caused if many problems are found in debugging phase of Pulsar teststand which take more time than expected to fix. Assume 25% overrun in schedule, leading to two weeks schedule slippage. We assign a probability of 30% to match the probability assigned for technical risk in 1.3.2.3, giving a Schedule risk factor of 0.075. 1.3.2.4.3 Motherboards fabrication: Cost and schedule overruns could be incurred if there is a manufacturing problem with the boards. Assuming the boards have to be completely remade gives a Moderate Cost Impact (0.2) and a Moderate Schedule Impact (0.2). The probability of this is low as we will use proven manufacturing and assembly houses, so we assign a probability of 10%. The Cost and Schedule Risk are both 0.02. 1.3.2.4.4 Mezzanine board fabrication: Cost and schedule risk is small as these are simple boards and manufacturing problems are unlikely, so we assign a probability of 5% and a Moderate Cost and Schedule Impact(0.2) which assumes the boards have to be completely rebuilt. This leads to Cost and Schedule Risk factors of 0.01. 1.3.2.4.5 S-link Auxilliary boards: Cost and schedule risks are small as these custom boards are very simple and the design will be based on existing boards. We assign a probability of 5% and a Moderate Schedule Impact (0.2) and a Low Cost Impact (0.1) which assumes the boards have to be completely rebuilt. This leads to a Cost Risk factor of 0.005 and a Schedule Risk factor of 0.01. 1.3.2.4.6 LSC/LDL + fiber boards:Negligible cost, schedule, technical risk -- easily obtainable commercial product. 1.3.2.4.7 PCI -> S-link boards: Negligible cost, schedule, technical risk -- easily obtainable commercial product. 1.3.2.4.8 S-link -> PCI boards: Negligible cost, schedule, technical risk -- easily obtainable commercial product. 1.3.2.4.9 L2 Decision processor: Negligible cost, schedule, technical risk -- easily obtainable commercial product. 1.3.2.5 Vertical Slice Test: Schedule slippage could occur if unexpectedly difficult problems are found. We assign a probability Assuming that previous testing is successful, and given the amount of commercial products used in the system, we feel the technical risk is small. Production Run of Pulsar L2 System 1.3.2.6.3 Motherboards fabrication: Cost and schedule overruns could be incurred if there is a manufacturing problem with the boards. Assuming the boards have to be completely remade gives a Very High Cost Impact (0.8) and a Moderate Schedule Impact (0.2). The probability of this is lower than in pre-production assuming few changes to the pc board as we will use proven manufacturing and assembly houses, but we will conservatively assign a probability of 10%. The Cost Risk is 0.08 and Schedule Risk is 0.02. 1.3.2.6.4 Mezzanine board fabrication: Cost and schedule risk is small as these are simple boards and manufacturing problems are unlikely, so we assign a probability of 5% and a High Cost (0.4) and Moderate Schedule Impact(0.2) which assumes the boards have to be completely rebuilt. This leads to a Cost Risk factor 0.02 and a Schedule Risk factor of 0.01. 1.3.2.6.5 S-link Auxilliary boards: Cost and schedule risks are small as these custom boards are very simple and the design will be based on existing boards. We assign a probability of 5% and a Moderate Schedule Impact (0.2) and a Low Cost Impact (0.1) which assumes the boards have to be completely rebuilt. This leads to a Cost Risk factor of 0.005 and a Schedule Risk factor of 0.01. 1.3.2.6.6 LSC/LDL + fiber boards:Negligible cost, schedule, technical risk -- easily obtainable commercial product. 1.3.2.6.7 PCI -> S-link boards: Negligible cost, schedule, technical risk -- easily obtainable commercial product. 1.3.2.6.8 S-link -> PCI boards: Negligible cost, schedule, technical risk -- easily obtainable commercial product. 1.3.2.6.9 L2 Decision processor: Negligible cost, schedule, technical risk -- easily obtainable commercial product. 1.3.2.7 System Integration standalone w/teststand: If unexpectedly difficult problems occur, there is a risk of schedule slippage. Assuming 50% overrun on the scheduled time of 3 months gives us a Moderate Schedule impact (0.2). The system is complex so there is a non-negligible chance of schedule overrun, but there is a great deal of experience available in commissioning the system. We assign a probability of 30% for schedule overrun. This gives a Schedule Risk factor of 0.06.