Muse: Multi-modal target speaker extraction with visual cues.

2020 
Speaker extraction algorithm relies on a reference speech to focus its attention on a target speaker. The reference speech is typically pre-registered as a speaker embedding. We believe that temporal synchronization between speech and lip movement is a useful cue, and target speaker embedding is also equally important. Motivated by this belief, we study a novel technique to use visual cues as the reference to extract target speaker embedding, without the need of pre-registered reference speech. We propose a multi-modal speaker extraction network, named MuSE, that is conditioned only on a lip image sequence for target speaker extraction. MuSE not only improves over AV-ConvTasnet baseline in terms of SI-SDR and PESQ, but also shows superior robustness in cross-domain evaluations.
    • Correction
    • Source
    • Cite
    • Save
    30
    References
    1
    Citations
    NaN
    KQI
    []