Insights into Application-level Solutions towards Resilient MPI Applications

2018 
Current petascale systems, formed by hundreds of thousands of cores, are highly dynamic, which causes that hardware failure rates are relatively high. Failure data collected from two large high-performance computing sites have been analysed in [1], showing failure rates from 20 to more than 1,000 failures per year, depending mostly on system size. This can be translated in a failure every 8.7 hours. Future exascale systems, formed by several millions of cores, will be hit by error/faults even more frequently due to their scale and complexity [2]. Thus, long-running applications in these systems will need to use fault tolerance techniques to ensure the successful execution completion.
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    22
    References
    0
    Citations
    NaN
    KQI
    []