WARM: Workload-Aware Reliability Management in Linux/Android

2017 
With CMOS scaling beyond 14 nm, reliability is a major concern for IC manufacturers. Reliability-aware design has a non-negligible overhead and cannot account for user experience in mobile devices. An alternative is dynamic reliability management (DRM), which counteracts degradation by adapting the operating conditions at runtime. In this paper, for the first time we formulate DRM as an optimization problem that accounts for reliability, temperature and performance. We develop an optimal policy for multicores using convex optimization, and show that it is not feasible to implement on real systems. For this reason, we propose workload-aware reliability management (WARM), a fast DRM technique adapting to diverse workload requirements to trade reliability and user experience. WARM is implemented and tested on a real Android device. WARM approximates the solution of the convex solver within 5% on average, while executing more than $400 {\times }$ faster. WARM integrates a thermal controller that allocates tasks to meet thermal constraints. This is required since degradation strongly depends on temperature. We show that WARM meets temperature constraints within 5% in 87.5% more cases than the state-of-the-art. We show that WARM task allocation achieves up to one year lifetime improvement for a multicore platform. It can achieve up to 100% of performance improvement on cluster architectures, such as big.LITTLE, while still guaranteeing the reliability target. Finally, we show that it achieves performance in the 4% of the maximum for a broad range of a applications, while meeting the reliability constraints.
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    45
    References
    23
    Citations
    NaN
    KQI
    []