Towards reliability-guided information integration in audio-visual speech recognition

Wentao Yu,Steffen Zeiler,Dorothea Kolossa

Towards reliability-guided information integration in audio-visual speech recognition

2021

Wentao Yu
Steffen Zeiler
Dorothea Kolossa

Audio-visual speech recognition can improve the recognition rate in many small-vocabulary tasks. But for large vocabularies, due to difficulties like unsatisfactory lipreading accuracies, improving the recognition rate over audio-only baselines remains difficult. In this work, we propose a new fusion strategy, which fuses the state posteriors of separate stream recognizers through a bidirectional LSTM network. Our proposed fusion strategy outperforms all baselines as well as oracle dynamic stream-weighting, which gives a theoretical upper bound for dynamic stream-weighting approaches. The proposed system achieves a relative word error rate reduction of 42.18% compared to the audio-only setup and 34.73% compared to the non-oracle dynamic stream-weighting baseline.

Keywords:

Word error rate
Audio-visual speech recognition
Oracle
Reliability (computer networking)
Information integration
Reduction (complexity)
Speech recognition
Computer science
Upper and lower bounds
State (computer science)

Correction
Source
Cite
Save
Machine Reading By IdeaReader

References

Citations