G-SEPM: building an accurate and efficient soft error prediction model for GPGPUs

2021 
As GPUs become ubiquitous in large-scale general purpose HPC systems (GPGPUs), ensuring the reliable execution of such systems in the presence of soft errors is increasingly essential. To provide insights into how resilient GPU programs are toward soft errors, researchers typically rely on random Fault Injection (FI) to evaluate the tolerance of programs. However, it is expensive to obtain a statistically significant resilience profile and not suitable to identify all the error-critical fault sites of GPU programs. To address the above challenges, in this work, we build a GPGPU-based Soft Error Prediction Model (G-SEPM) that can replace FI to estimate the resilience characteristics of individual fault sites accurately and efficiently. We observe that the instruction-type, bit-position, bit-flip direction, and error propagation information have capabilities to characterize fault site resiliency. Leveraging these heuristic features, G-SEPM drives the machine learning model to reveal the hidden interactions among fault site resiliency and our observed features. Experimental results demonstrate that G-SEPM achieves high accuracy for fault site error estimation and critical fault site identification while introducing negligible overhead. In addition, G-SEPM can provide essential insight for programmers/architects to design more cost-effective soft error mitigation solutions.
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    34
    References
    0
    Citations
    NaN
    KQI
    []