Error Correlation Prediction in Lockstep Processors for Safety-Critical Systems

2018 
This paper presents a new phenomenon called error correlation prediction for lockstep processors. Lockstep processors run the same copy of a program, and their outputs are compared at every cycle to detect divergence, and have been popular in safety-critical systems. When the lockstep error checker detects an error, it alerts the safety-critical system by putting the lockstep processor in a safe state in order to prevent hazards. This is done by running the online diagnostics to identify the cause of the error because the lockstep processor has no knowledge of whether the error is caused by a transient or permanent fault. The online diagnostics can be avoided if the error is caused by a transient fault, and the lockstep processor can recover from it. If, however, it is caused by a permanent fault, having prior knowledge about error's likely location(s) within the CPU speeds up the diagnostics process. We discover that the error's type and likely location(s) inside CPUs from which the fault may have originated can be predicted by analyzing the output signals of the CPU(s) when the error is detected. We design a simple static predictor exploiting this phenomenon and show that system availability can be increased by 42-64% with an overhead of less than 2% in silicon area and power.
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    32
    References
    2
    Citations
    NaN
    KQI
    []