Joint Event Detection and Description in Continuous Video Streams

2019 
Dense video captioning involves first localizing events in a video and then generating captions for the identified events. We present the Joint Event Detection and Description Network (JEDDi-Net) for solving this task in an end-to-end fashion, which encodes the input video stream with three-dimensional convolutional layers, proposes variable- length temporal events based on pooled features, and then uses a two-level hierarchical LSTM module with context modeling to transcribe the event proposals into captions. We show the effectiveness of our proposed JEDDi-Net on the large-scale ActivityNet Captions dataset.
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    33
    References
    30
    Citations
    NaN
    KQI
    []