PerfEstimator: A Generic and Extensible Performance Estimator for Data Parallel DNN Training

2021 
Understanding the performance of data parallel DNN training at large-scale is crucial for supporting efficient DNN cloud deployment as well as facilitating the design and optimization of scalable DNN systems. Existing works adopt analytical modeling, which may fall short in capturing the system behaviors resulting from the fast evolving DNN systems and constantly proposed optimizations. In this paper, we present PerfEstimator, a generic and extensible estimator for accurate performance estimation of large-scale data parallel DNN training. PerfEstimator is driven by three major components, namely, an extensible attributed graph based performance model, a computation and synchronization profiling and simulating tool for obtaining runtime time costs on a single machine, and a computation-synchronization pipeline builder to derive the scaling factors. Our evaluation highlights that PerfEstimator can accurately predict the performance of data parallel DNN training jobs with a prediction error of 0.2-11%.
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    24
    References
    0
    Citations
    NaN
    KQI
    []