Coalescing and Deduplicating Incremental Checkpoint Files for Restore-Express Multi-Level Checkpointing

2018 
In multicore systems, a large portion of checkpoint time overhead can be hidden from the execution critical path by resorting to a dedicated checkpointing thread run concurrently with regular execution threads for compressing checkpoint files to lower checkpointing overhead. On the other hand, the restore time is on the critical path that cannot be hidden, making it most important to accelerate execution restore upon failures. This work pursues a restore-express (REX) strategy for multi-level checkpointing (MLC), applicable to any incremental checkpointing (IC). Oblivious to application codes, REX employs adaptive IC (AIC) for local (L1) checkpointing and follows our runtime control for second-level (L2) checkpointing, with its aim at express restore from failures while holding down the overall execution time. It takes advantage of two unique insights for overhead reduction: (1) the modified pages of an incremental checkpoint file are likely to exist in a subsequent checkpoint file, and (2) many data patterns (on an average, some 40 percent of them) stay unchanged from one L2 checkpoint file to the next. These insights enable REX to (1) coalesce IC files (by involving only the last copy of every dirty page among files) and (2) boost file compression across multiple L2 checkpoints. Time and storage overhead results of REX during normal job execution are gathered for 16 benchmarks from SPEC, PARSEC, and NPB suites. The evaluation outcomes of the execution restore time confirm that REX is fast and able to quicken restore by a factor of 4.5× when compared with its IC counterpart (without utilizing the unique insights), while incurring same execution time overhead.
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    35
    References
    3
    Citations
    NaN
    KQI
    []