I just got paged, what now?
check the wiki for the beginning of this page
advanced troubleshooting
This section is mainly intended for the backup pager carrier trying to
diagnose problems. This is meant to be done in this order and for you
stop once you fixed the problem.
- If there has been recent work on a crate that shows input or
output errors check that the connections have not been knocked
lose by the work.
- If there are input errors on the finder or internal errors on an
axial finder, mask the appropriate
input channels. This is most likely a broken XTC that should be
swapped on the next access.
- If there are internal errors in a stereo finder (or L2 output
errors) try masking on one XTC at a
time. If the problem goes away it is probably an XTC timing
problem, which should probably be remedied by swapping the XTC.
For L2 problems make sure you test all XTCs connected to the
Finder not just the ones connected to the chip in question.
- If the errors are Finder or Linker/SLAM related test the board
by swapping it with a neighbour or even better a board in a
different crate. For SLAM boards take the A/B designs into
account. If the problem moves with the board replace the board
with a spare (or try reloading the firmware first).
- If the error is an I/O error try swapping cables/fibers with
spares or neighbours to see if the problem moves or goes away. If
it does, use a spare fiber/cable.
- Ok, you now are stuck with an error that is intrinsic to the
slot, not the board, but not a cable problem. Let's hope it is
some sort of timing problem. For the axial path mask on all XTCs
leading up to the place in question (including neighbours), this
should make the errors disappear by having just ones flow through
the system. For the stereo path do the same, but only mask one
XTC at a time to avoid creating problems by masking an entire
finder chip.
- Ok, you now are stuck with an error that is intrinsic to the
slot, not the board, but not a cable or XTC timing problem. Good
Luck! Just kidding, you might now really try to get a hold of me.
It sounds interesting.
SLAM troubleshooting
You're carrying the SLAM backup and you've been paged.
The following is a list of information and steps to help you
resolve the problem.
- Current Status: Before checking anything, learn about recent
Stereo Finder maintenance. An unseated
cable can manifest itself in strange ways, and knowing if
recent operations could have unseated cables
provides important context for diagnosing communication
errors.
- Stereo Finder to SLAM:
Check to see if there are errors communicating
between the Stereo Finders and the SLAM
by looking at the TrigMon plots:
XSLMonitorXSLD_XSFD_PixelDataErrorFinderSlotCanvas
and XSLMonitorXSLD_XSFD_PixelDataErrorSlamSLCanvas.
- SLAM to Tracklist
Check to see if there are errors communicating
between the SLAM and the tracklist board by looking
at TrigMon plots:
TracklistMonitorTracklistMon_XSLD_TP2D_InvalidTrackWordDataErrorCanvas
and
TracklistMonitorTracklistMon_XSLD_TP2D_TrackWordDataErrorCanvas
Consider the following tools for addressing problems:
- Power cycle: If the board is in a funny state,
power cycling the crate might resolve the issue in a
way that an HRR/new run will not. This
is a low-overhead option to resolve problems.
- Reseat cables:
If there is an error in a single finder board
in the
XSLMonitorXSLD_XSFD_PixelDataErrorFinderSlotCanvas,
then this is problem with the connction between the boards or
the Stereo Finder Board itself.
Try reseating the fibers on the assocaited stereo finder
or reseating the board itself.
- SLAM fiber input/output is Correlated!
WARNING: The output to the tracklist board uses the same signals as the
input from the Stereo Finder boards. In other words, a problem with
input from the Stereo Finders can cause errors in transmission to
the tracklist board.
If you see errors any errors in the XSLD_XSFD
trigmon plots, go check the associated finder boards. Resolving the
finder-SLAM communication problem will likely resolve the
SLAM-tracklist communication problem.
- Swap boards: If problems presist despite adjustments
to the board/cable seating, try swapping the board with
an equivalent neighbor. Keep in mind that there is a SLAM
even/odd design, so you need to swap not with the adjacent
slam board, but with the next to nearest. If the problem
stays with slot of the crate, then it is likely a stereo
finder communication problem.
- Replacing a board with a spare: Spare boards
are located in racks on the second floor of B0, in the
first "test stand" room. If you need to replace a
SLAM with a spare, you will need to check to make
sure that the slam has the most up-to-date firmware
version stored in it's flash memory.
- Log into
b0gateway as cdf_xft.
This account is the only one set up to
run xftdaq. If you try to run xftdaq using
your own account, it will probably segfault.
-
Start
xftdaq.
-
Check the firmware version stored in the
flash. Use the GUI to navigate to the board
that you are interested in, the click the
button to check the version.
Version # 21XX means an ODD design, XX is the version
Version # 20XX means an EVEN design, XX is the version
Version # 3416 means pass-thru design
-
Check the firmware version in the adjacent
SLAM board. The adjacent board will have a
design that is either even or odd with a
specific version number. The new board
need to have the complimentary design
with the same version number.
Upload the appropriate firmware to the
target board. The firmware is located:
/cdf/onln/data/cdf_xft/XFTbin/slam/
or
/cdf/onln/home/cdf_xft/firmware
For convenience, symlinks have been
set up as SLAM_odd_2023_3416.pof
and SLAM_even_2023_3416.pof