Predicting and mitigating single-event upsets in DRAM using HOTH

2021 
Abstract There is a growing demand for using commodity memory and storage solutions to make commercial aerospace ventures economically feasible. Existing radiation-hardened computer systems cannot meet this need alone. These hardened systems provide sufficient protection against the harsh environment of the upper atmosphere and low-Earth orbit, but require dramatically increased cost and utilize commercially out of date architectures and fabrication technologies. If new aerospace systems can take advantage of the latest commodity memories, they can leverage relevant advanced fabrication processes and the economy of scale to control costs. Of course, such systems would require new strategies to maintain appropriate tolerance and/or resilience to faults from the harsh environment. In this work, we observe that single-event effects (SEEs) in recent generation DRAM memories are not entirely random, and in fact are often highly predictable under neutron radiation bombardment. We demonstrate the existence of a small number of weak cells responsible for the vast majority of single-bit, SEEs. Based on this observation, we present a memory fault mapping and tolerance approach called HOTH to mitigate these predictable fault modes in conjunction with more random/unpredictable SEEs in DDR3 memory. In HOTH, both single- and multi-bit effects can be mitigated individually at runtime using a combination of existing error-correcting code techniques in Chipkill ECC and a fault map framework. The HOTH fault map is stored in the same DRAM that is subject to SEEs and leverages a fault-tolerance approach to mitigate SEEs that might appear in that part of the storage. Using data from different memory DIMMs, form factors, and radiation incidence angles we show that with HOTH we can improve uncorrectable fault rate by at least ten orders of magnitude and increase mean-time-to-failure to thousands of years, allowing extended service times in harsh environments.
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    37
    References
    2
    Citations
    NaN
    KQI
    []