Crossmodal learning for audio-visual speech event localization

Rahul Sharma,Krishna Somandepalli,Shrikanth S. Narayanan

Crossmodal learning for audio-visual speech event localization

2020

An objective understanding of media depictions, such as about inclusive portrayals of how much someone is heard and seen on screen in film and television, requires the machines to discern automatically who, when, how and where someone is talking. Media content is rich in multiple modalities such as visuals and audio which can be used to learn speaker activity in videos. In this work, we present visual representations that have implicit information about when someone is talking and where. We propose a crossmodal neural network for audio speech event detection using the visual frames. We use the learned representations for two downstream tasks: i) audio-visual voice activity detection ii) active speaker localization in video frames. We present a state-of-the-art audio-visual voice activity detection system and demonstrate that the learned embeddings can effectively localize to active speakers in the visual frames.

Keywords:

Correction
Source
Cite
Save
Machine Reading By IdeaReader

References

Citations