Data Movement Is All You Need: A Case Study of Transformer Networks
2020
Transformer neural networks have become widely used for language modeling and
sequence learning tasks, and are one of the most important machine learning
workloads today. Training one is a very compute-intensive task, often taking
days or weeks, and significant attention has been given to optimizing
transformers. Despite this, existing implementations do not efficiently utilize
GPUs. We find that data movement is the key bottleneck when training. Due to
Amdahl's Law and massive improvements in compute performance, training has now
become memory-bound. Further, existing frameworks use suboptimal data layouts.
Using these insights, we present a recipe for globally optimizing data movement
in transformers. We reduce data movement by up to 22.91% and overall achieve a
1.30x performance improvement over state-of-the-art frameworks when training
BERT. Our approach is applicable more broadly to optimizing deep neural
networks, and offers insight into how to tackle emerging performance
bottlenecks.
Keywords:
- Correction
- Source
- Cite
- Save
- Machine Reading By IdeaReader
0
References
3
Citations
NaN
KQI