Two Stage Audio-Video Speech Separation using Multimodal Convolutional Neural Networks

2019 
The performance of the audio-only neural networks based monaural speech separation methods is still limited, particularly when multiple-speakers are active. The very recent method [1] used the audio-video (AV) model to find the non-linear relationship between the noisy mixture and the desired speech signal. However, the over-fitting problem always happens when the AV model is trained. Hence, the separation performance is limited. To address this limitation, we propose a system with two sequentially trained AV models to separate the desired speech signal. In the proposed system, after the first AV model is trained, its output is used to calculate the training target of the second AV model, which is exploited to further improve the separation performance. The GRID audiovisual sentence corpus is used to generate the training and testing datasets. The signal to distortion ratio (SDR) and short-time objective intelligibility (STOI) proved the proposed system outperforms the state-of-the-art method.
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    20
    References
    0
    Citations
    NaN
    KQI
    []