Gandiva: Introspective Cluster Scheduling For Deep Learning

Authors:
Wencong Xiao Microsoft Research Asia
Romil Bhardwaj Microsoft Research India
Ramachandran Ramjee Microsoft Research India
Muthian Sivathanu Microsoft Research India
Nipun Kwatra Microsoft Research India
Zhenhua Han Microsoft Research Asia
Pratyush Patel Microsoft Research India
Xuan Peng Microsoft Research Asia
Hanyu Zhao Microsoft Research Asia
Quanlu Zhang Microsoft Research Asia
Fan Yang Microsoft Research Asia
Lidong Zhou Microsoft Research Asia

Introduction:

the authors introduce Gandiva, a new cluster scheduling framework that utilizes domain-specific knowledge to improve latency and efficiency of training deep learning models in a GPU cluster.

Abstract:

We introduce Gandiva, a new cluster scheduling framework that utilizes domain-specific knowledge to improve latency and efficiency of training deep learning models in a GPU cluster. One key characteristic of deep learning is feedback-driven exploration, where a user often runs a set of jobs (or a multi-job) to achieve the best result for a specific mission and uses early feedback on accuracy to dynamically prioritize or kill a subset of jobs; simultaneous early feedback on the entire multi-job is critical. A second characteristic is the heterogeneity of deep learning jobs in terms of resource usage, making it hard to achieve best-fit a priori. Gandiva addresses these two challenges by exploiting a third key characteristic of deep learning: intra-job predictability, as they perform numerous repetitive iterations called mini-batch iterations. Gandiva exploits intra-job predictability to time-slice GPUs efficiently across multiple jobs, thereby delivering low-latency. This predictability is also used for introspecting job performance and dynamically migrating jobs to better-fit GPUs, thereby improving cluster efficiency. We show via a prototype implementation and micro-benchmarks that Gandiva can speed up hyper-parameter searches during deep learning by up to an order of magnitude, and achieves better utilization by transparently migrating and time-slicing jobs to achieve better job-to-resource fit. We also show that, in a real workload of jobs running in a 180-GPU cluster, Gandiva improves aggregate cluster utilization by 26%, pointing to a new way of managing large GPU clusters for deep learning.

You may want to know: