Time-Based Roofline for Deep Learning Performance Analysis

2020 
Deep learning applications based on neural networks are generating considerable interest in various fields due to their high accuracy. Such an application is usually very compute-intensive thus requires a long run time. Researchers and engineers are actively exploring new solutions to this issue from both hardware and software/algorithm sides. However, little previous work has focused on providing a practical methodology to characterize deep learning performance bottlenecks and potentially guide the following optimization efforts. In this paper, we introduce an extension of the Roofline model and use it to analyze two representative computation kernels in deep learning, 2D convolution and long short-term memory, on NVIDIA GPUs. This new time-based Roofline model incorporates both compute/bandwidth complexity and run time in its formulae to demonstrate performance issues that cannot be reflected by the classic Roofline. Factors such as arithmetic intensity, data transfer, kernel launch overhead, and the Tensor Core usage will be examined by varying different parameters such as batch size and feature size, etc. This work helped form a more systematic way to understand the performance issue of deep learning applications. Last but not least, this generic performance model can be applied to a wide category of applications besides deep learning as well.
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    31
    References
    3
    Citations
    NaN
    KQI
    []