Structured Sparse Attention for end-to-end Automatic Speech Recognition

2020 
The Softmax normalization function-based attention mechanism is often employed by End-to-End Automatic Speech Recognition (E2E ASR) models to tell the network where to focus within the input. However, this mechanism leads to the attention distribution becoming increasingly flatter as the input sequence length increases, since the output probability of this function is dense and nonnegative, which makes it unable to highlight the important information in speech. In this paper, we present two sparse attention mechanisms for ASR tasks with long utterances, which try to improve the attention mechanism by introducing the sparse transformation. First, we propose to replace the Softmax with the Sparsemax that normalizes the attention weight by finding the closest point in the probability simplex. Then, considering the structured characteristics, the pronunciation has a relatively stable duration. Therefore, we further present a structured sparse transformation that forces the networks to pay attention to a continuous segment of speech by applying the l 2 penalty. A noniterative solution algorithm that can be used in the backpropagation is designed here. The experiments show that our methods achieve better ASR results compared to a well-tuned attention-based baseline system on a character ASR task.
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    19
    References
    1
    Citations
    NaN
    KQI
    []