The problem:
At Caldet this summer, we found an intermittent
problem: a particular timing card in a particular crate was correctly
serving buffer swaps to the VARCs in the crate, but was NOT correctly
serving VME interrupts to the ROP. This resulted in a completely
corrupted data stream: as soon as a swap was lost in this way, it
meant that the data stream was incorrectly associating hits (with TDC
time stamp) with the correct second (which is tacked on by the ROP).
This problem would eventually make the ROP fall over... but not until
many minutes had passed.
The scary thing:
The timing card in question was a refitted card
that had come straight from Soudan a few weeks before.
The really scary thing:
We saw this problem at Caldet because a
lot of attention was being paid to end-to-end timing in order to sort
out near/far timing issues. No such attention was being paid to the
data at Soudan.
The extremely scary thing.
This failure mode is _worst_ if it
happens rarely. If a single crate misses a buffer at the start of a
run, that crate will from then on be serving out corrupt data. One
lost buffer means 1/20th of the data coming out has the wrong
timestamp and will therefore not be correctly associated with
triggers... this data is forever lost. The remaining 95% of the data
would look just fine, however... so analysis of data would not
neccessarily see anything was very wrong. We'd just have dead parts
of the detector at certain times. The ROP would be one buffer out at
the end, but no one would know.
How to look for this:
There are several ways to identify this problem:
(1) Take untriggered data. Look for time frames where the timestamps from a crate make a large jump, or where the ordering of the hits in the timeframe (assuming no TP sorting) is not aligned with the timeframe. This is how we first diagnosed the problem at CalDet.
(2) Perform similar tests at the Trigger Processor as a data integrity check in situ. This would require some smart logic in the TP, and is probably not feasable; the TP is in general not responsible for "detector wellness" as I understand it. I leave this in only for completeness.
(3) Take a long pedestal run, where executes are pulsing all crates for a known number of pulses, then turn off, then start again. Count all the pulses.. look for any discrepancies.
I've used this last method in a one hour run that Tass and Jeff Hartnell have taken for me on October 16 (run 9458). Crate counts looked like this:
tf cr0 cr1 cr2 ... --------------------------------------------------------------------- 1 0 0 0 0 0 0 0 0 0 0 0 2 6534 6534 6534 6534 8712 8712 6534 6534 6534 6534 2178 3 66 66 66 66 88 88 66 66 66 66 22 4 0 0 0 0 0 0 0 0 0 0 0 5 6534 6534 6534 6534 8712 8712 6534 6534 6534 6534 2178 6 66 66 66 66 88 88 66 66 66 66 22This continued in this repeating pattern through 3600 seconds. This means that each crate had ~20000 buffers full of pedestals.
Not one pedestal event was missed by any crate, meaning that the mean failure rate is somewhere less than one buffer in 20000.
Conclusion:
The problem seen at CalDet is not present in the
current configuration at Soudan. There is little evidence to show
that it was NEVER present. I suspect that the observed failure was
simply a mis-seated board and is unlikely to be seen again, but I am
unable as yet to prove this is so. If we wish to data taken in the
last year or so for publication, I believe we should make some effort
to prove this problem does not exist.
----Nathaniel Tagg