An Attention Based Speaker-Independent Audio-Visual Deep Learning Model for Speech Enhancement

2020 
Speech enhancement aims to improve speech quality in noisy environments. While most speech enhancement methods use only audio data as input, joining video information can achieve better results. In this paper, we present an attention based speaker-independent audio-visual deep learning model for single channel speech enhancement. We apply both the time-wise attention and spatial attention in the video feature extraction module to focus on more important features. Audio features and video features are then concatenated along the time dimension as the audio-visual features. The proposed video feature extraction module can be spliced to the audio-only model without extensive modifications. The results show that the proposed method can achieve better results than recent audio-visual speech enhancement methods.
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    20
    References
    2
    Citations
    NaN
    KQI
    []