Detailed Performance Analysis of Distributed Tensorflow on a GPU Cluster using Deep Learning Algorithms

Abid Muslim Malik,Micheal Lu,Nathenial Wang,Yeiwei Lin,Shinjae Yoo

Detailed Performance Analysis of Distributed Tensorflow on a GPU Cluster using Deep Learning Algorithms

2018

Long training times for building a high accuracy deep neural networks (DNNs) is impeding research for new DNN architectures. For example, time for training GoogleNet with the ImageNet dataset on a single Nvidia K20 GPU almost takes 25 days. Therefore, there is a great need in the AI community to speed up the training phase, especially when using a large dataset. For this, we need Distributed Deep Neural Networks (DDNNs) that can scale well with more computation resources. However, this involves two challenges.First, the deep learning framework or training library must support inter-node communication. Second, the user must modify the code to take advantage of the inter-node communication. The changes to the code can be minimal to significant depending upon the user expertize in the distributed systems. Current DNN frameworks support distributed learning using MPI. However, these frameworks come with poorly understood overheads associated with communication and data management. Tensorflow provides APIs for distributed learning using MPI programming model and gRPC. These APIs are not easy to use for a domain expert for designing an efficient distributed learning model. Recently, Uber Inc. provides the Horovod Framework which gives a fast and easy way to support distributed learning using Tensorflow, Pytorach, and Keras. In this paper we provide a detailed performance analysis of distributed Tensorflow using Horovod. We implemented distributed learning for AlexNet, GoogleNet, and ResNet50 using Horovod. We used Nvidia K 40,K80, and P100 GPUs for our experimentation. We used synthetic image data with different runtime variables (batch size and number of GPUs). Our results shows that the Horovod framework gives almost linear throughput (images/sec) scalability up to 256 GPUs.

Keywords:

Correction
Source
Cite
Save
Machine Reading By IdeaReader

References

Citations