Job Placement Strategy with Opportunistic Resource Sharing for Distributed Deep Learning Clusters

2020 
Distributed deep learning frameworks train large deep leaning workload with multiple training jobs on shared distributed GPU servers. There are new challenges when scheduling resources for these systems. Modern deep learning training jobs tend to consume large amount of GPU memory. A training job has an iterative nature that causes the memory usage fluctuate overtime. Jobs sharing a host may suffer from significant performance degradation caused by memory overload in runtime. Moreover, even without memory overloads, deep learning training jobs still experience different levels of performance interference when sharing a GPU device. This paper studies these two issues. We introduced an opportunistic memory sharing model to allocate resources for training jobs with time-varying memory requirements. Based on this model, we introduced an opportunistic Job Placement Problem (OJPP) for shared GPU clusters that seeks job placement configurations using minimum number of GPU devices and guarantees user-defined performance requirements. We proposed a greedy algorithm and a heuristic algorithm with computational complexities of $O(n\log n)$ and $O(n^{2}\log n)$, respectively, to solve the problem. Extensive experiments are conducted using a GPU cluster to verify the correctness, effectiveness, and the scalability of our approaches. The proposed approach achieved over 80% percent of the standalone performance, in term of average job completion time, with less than 30% extra resources consumption.
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    25
    References
    0
    Citations
    NaN
    KQI
    []