Personalized Multi-modal Video Retrieval on Mobile Devices

2021 
Current video retrieval systems on mobile devices cannot process complex natural language queries, especially if they contain personalized concepts, such as proper names. To address these shortcomings, we propose an efficient and privacy-preserving video retrieval system that works well with personalized queries containing proper names, without re-training using personalized labelled data from users. Our system first computes an initial ranking of a video collection by using a generic attention-based video-text matching model (i.e., a model designed for non-personalized queries), and then uses a face detector to conduct personalized adjustments to these initial rankings. These adjustments are done by reasoning over the face information from the detector and the attention information provided by the generic model. We show that our system significantly outperforms existing keyword-based retrieval systems, and achieves comparable performance to the generic matching model fine-tuned on plenty of labelled data. Our results suggest that the proposed system can effectively capture both semantic context and personalized information in queries.
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    5
    References
    0
    Citations
    NaN
    KQI
    []