A Robust Audio-Visual Speech Enhancement Model

2020 
Most existing audio-visual speech enhancement (AVSE) methods work well in conditions with strong noise, however when applied to conditions with a medium SNR, serious performance degradations are often observed. These degradations can be partly attributed to the feature-fusion(early fusion etc.) architecture that tightly couples the audio information that is very strong and the visual information that is relatively weak. In this paper, we present a safe AVSE approach that can make the visual stream contribute to audio speech enhancment(ASE) safely in conditions of various SNRs by late fusion.The key novelty is two-fold: Firstly, we define power binary masks (PBMs) as a rough representation of speech signals. This rough representation admits the weakness of the visual information and so can be easily predicted from the visual stream. Secondly, we design a posterior augmentation architecture that integrate the visual-derived PBMs to the audio-derived masks via a gating network. By this architecture, the entire performance is lower-bounded by the audio-based component. Our experiments on the Grid dataset demonstrated that this new approach consistently outperforms the audio-based system in all noise conditions, confirming that it is a safe way to incorporate visual knowledge in speech enhancement.
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    17
    References
    13
    Citations
    NaN
    KQI
    []