Heterogeneity-Aware Distributed Machine Learning Training via Partial Reduce

Xupeng Miao,Xiaonan Nie,Yingxia Shao,Zhi Yang,Jiawei Jiang,Lingxiao Ma,Bin Cui

Heterogeneity-Aware Distributed Machine Learning Training via Partial Reduce

2021

All-reduce is the key communication primitive used in distributed data-parallel training due to the high performance in the homogeneous environment. However, All-reduce is sensitive to stragglers and communication delays as deep learning has been increasingly deployed on the heterogeneous environment like cloud. In this paper, we propose and analyze a novel variant of all-reduce, called partial-reduce, which provides high heterogeneity tolerance and performance by decomposing the synchronous all-reduce primitive into parallel-asynchronous partial-reduce operations. We provide theoretical guarantees, proving that partial-reduce converges to a stationary point at the similar sub-linear rate as distributed SGD. To enforce the convergence of the partial-reduce primitive, we further propose a dynamic staleness-aware distributed averaging algorithm and implement a novel group generation mechanism to prevent possible update isolation in heterogeneous environments. We build a prototype system in the real production cluster and validate its performance under different workloads. The experiments show that it is 1.21x-2x faster than other state-of-the-art baselines.

Keywords:

Correction
Source
Cite
Save
Machine Reading By IdeaReader

References

Citations