This is a short email in the 'no news is good news' camp. In short: I've looked for a failure mode in the far detector and not found it evident. We need to decide if it's something we need to put in regular checks for. Experts should read on.

The problem:
At Caldet this summer, we found an intermittent problem: a particular timing card in a particular crate was correctly serving buffer swaps to the VARCs in the crate, but was NOT correctly serving VME interrupts to the ROP. This resulted in a completely corrupted data stream: as soon as a swap was lost in this way, it meant that the data stream was incorrectly associating hits (with TDC time stamp) with the correct second (which is tacked on by the ROP). This problem would eventually make the ROP fall over... but not until many minutes had passed.

The scary thing:
The timing card in question was a refitted card that had come straight from Soudan a few weeks before.

The really scary thing:
We saw this problem at Caldet because a lot of attention was being paid to end-to-end timing in order to sort out near/far timing issues. No such attention was being paid to the data at Soudan. The extremely scary thing.
This failure mode is _worst_ if it happens rarely. If a single crate misses a buffer at the start of a run, that crate will from then on be serving out corrupt data. One lost buffer means 1/20th of the data coming out has the wrong timestamp and will therefore not be correctly associated with triggers... this data is forever lost. The remaining 95% of the data would look just fine, however... so analysis of data would not neccessarily see anything was very wrong. We'd just have dead parts of the detector at certain times. The ROP would be one buffer out at the end, but no one would know.

How to look for this:
There are several ways to identify this problem:

(1) Take untriggered data. Look for time frames where the timestamps from a crate make a large jump, or where the ordering of the hits in the timeframe (assuming no TP sorting) is not aligned with the timeframe. This is how we first diagnosed the problem at CalDet.

(2) Perform similar tests at the Trigger Processor as a data integrity check in situ. This would require some smart logic in the TP, and is probably not feasable; the TP is in general not responsible for "detector wellness" as I understand it. I leave this in only for completeness.

(3) Take a long pedestal run, where executes are pulsing all crates for a known number of pulses, then turn off, then start again. Count all the pulses.. look for any discrepancies.

I've used this last method in a one hour run that Tass and Jeff Hartnell have taken for me on October 16 (run 9458). Crate counts looked like this:

tf   cr0   cr1   cr2  ...
---------------------------------------------------------------------
1      0     0     0     0     0     0     0     0     0     0     0   
2   6534  6534  6534  6534  8712  8712  6534  6534  6534  6534  2178
3     66    66    66    66    88    88    66    66    66    66    22  
4      0     0     0     0     0     0     0     0     0     0     0   
5   6534  6534  6534  6534  8712  8712  6534  6534  6534  6534  2178
6     66    66    66    66    88    88    66    66    66    66    22  
This continued in this repeating pattern through 3600 seconds. This means that each crate had ~20000 buffers full of pedestals.

Not one pedestal event was missed by any crate, meaning that the mean failure rate is somewhere less than one buffer in 20000.

Conclusion:
The problem seen at CalDet is not present in the current configuration at Soudan. There is little evidence to show that it was NEVER present. I suspect that the observed failure was simply a mis-seated board and is unlikely to be seen again, but I am unable as yet to prove this is so. If we wish to data taken in the last year or so for publication, I believe we should make some effort to prove this problem does not exist.

----Nathaniel Tagg


tagg
Last modified: Thu Oct 17 16:30:07 BST 2002