Multi-view Surgical Video Action Detection via Mixed Global View Attention.

2021 
Automatic surgical activity detection in the operating room can enable intelligent systems that potentially lead to more efficient surgical workflow. While real-world implementations of video activity detection in the OR most likely rely on multiple video feeds observing the environment from different view points to handle occlusion and clutter, the research on the matter has been left under-explored. This is perhaps due to the lack of a suitable dataset, thus, as our first contribution, we introduce the first large-scale multi-view surgical action detection dataset that includes over 120 temporally annotated robotic surgery operations, each recorded from 4 different viewpoints, resulting in 480 full-length surgical videos. As our second contribution, we design a novel model architecture that can detect surgical actions by utilizing multiple time-synchronized videos with shared field of view to better detect the activity that is taking place at any time. We explore early, hybrid, and late fusion methods for combining data from different views. We settle on a late fusion model that remains insensitive to sensor locations and feeding order, improving over single-view performance by using a mixing in the style of attention. Our model learns how to dynamically weight and fuse information across all views. We demonstrate improvements in mean Average Precision across the board using our new model.
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    29
    References
    0
    Citations
    NaN
    KQI
    []