Joint Event Detection and Description in Continuous Video Streams

Huijuan Xu,Boyang Li,Vasili Ramanishka,Leonid Sigal,Kate Saenko

Joint Event Detection and Description in Continuous Video Streams

2019

Huijuan Xu
Boyang Li
Vasili Ramanishka
Leonid Sigal
Kate Saenko

Dense video captioning involves first localizing events in a video and then generating captions for the identified events. We present the Joint Event Detection and Description Network (JEDDi-Net) for solving this task in an end-to-end fashion, which encodes the input video stream with three-dimensional convolutional layers, proposes variable- length temporal events based on pooled features, and then uses a two-level hierarchical LSTM module with context modeling to transcribe the event proposals into captions. We show the effectiveness of our proposed JEDDi-Net on the large-scale ActivityNet Captions dataset.

Keywords:

Speech recognition
Visualization
Closed captioning
Context model
Computer science

Correction
Source
Cite
Save
Machine Reading By IdeaReader

References

Citations