Complex scientific applications made fault-tolerant with the sparse grid combination technique
2016
Ultra-large–scale simulations via solving partial differential equations (PDEs) require very large computational systems for their timely solution. Studies shown the rate of failure grows with the system size, and these trends are likely to worsen in future machines. Thus, as systems, and the problems solved on them, continue to grow, the ability to survive failures is becoming a critical aspect of algorithm development. The sparse grid combination technique (SGCT) which is a cost-effective method for solving higher dimensional PDEs can be easily modified to provide algorithm-based fault tolerance.In this article, we describe how the SGCT can produce fault-tolerant versions of the Gyrokinetic Electromagnetic Numerical Experiment plasma application, Taxila Lattice Boltzmann Method application, and Solid Fuel Ignition application. We use an alternate component grid combination formula by adding some redundancy on the SGCT to recover data from lost processes. User-level failure mitigation (ULFM) message pass...
Keywords:
- Correction
- Source
- Cite
- Save
- Machine Reading By IdeaReader
28
References
15
Citations
NaN
KQI