Transformer-based Arabic Dialect Identification
This paper presents a dialect identification (DID) system based on the transformer neural network architecture. The conventional convolutional neural network (CNN)-based systems use the shorter receptive fields. We believe that long range information is equally important for language and DID, and self-attention mechanism in transformer captures the long range dependencies. In addition, to reduce the computational complexity, self-attention with downsampling is used to process the acoustic features. This process extracts sparse, yet informative features. Our experimental results show that transformer outperforms CNN-based networks on the Arabic dialect identification (ADI) dataset. We also report that the score-level fusion of CNN and transformer-based systems obtains an overall accuracy of 86.29% on the ADI17 database.