Towards reliability-guided information integration in audio-visual speech recognition

2021 
Audio-visual speech recognition can improve the recognition rate in many small-vocabulary tasks. But for large vocabularies, due to difficulties like unsatisfactory lipreading accuracies, improving the recognition rate over audio-only baselines remains difficult. In this work, we propose a new fusion strategy, which fuses the state posteriors of separate stream recognizers through a bidirectional LSTM network. Our proposed fusion strategy outperforms all baselines as well as oracle dynamic stream-weighting, which gives a theoretical upper bound for dynamic stream-weighting approaches. The proposed system achieves a relative word error rate reduction of 42.18% compared to the audio-only setup and 34.73% compared to the non-oracle dynamic stream-weighting baseline.
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    17
    References
    0
    Citations
    NaN
    KQI
    []