RealFormer: Transformer Likes Residual Attention

Ruining He,Anirudh Ravula,Bhargav Kanagal,Joshua Ainslie

RealFormer: Transformer Likes Residual Attention

2021

Ruining He
Anirudh Ravula
Bhargav Kanagal
Joshua Ainslie

Transformer is the backbone of modern NLP models. In this paper, we propose RealFormer, a simple Residual Attention Layer Transformer architecture that significantly outperforms canonical Transformers on a spectrum of tasks including Masked Language Modeling, GLUE, and SQuAD. Qualitatively, RealFormer is easy to implement and requires minimal hyper-parameter tuning. It also stabilizes training and leads to models with sparser attentions. Code will be open-sourced upon paper acceptance.

Keywords:

Code (cryptography)
Residual
simple
transformer
Computer engineering
Computer science
Layer (object-oriented design)
Language model

Correction
Source
Cite
Save
Machine Reading By IdeaReader

References

Citations