Multimodal Entity Linking: A New Dataset and A Baseline

2021 
In this paper, we introduce a new Multimodal Entity Linking (MEL) task on the multimodal data. The MEL task discovers entities in multiple modalities and various forms within large-scale multimodal data and maps multimodal mentions in a document to entities in a structured knowledge base such as Wikipedia. Different from the conventional Neural Entity Linking (NEL) task that focuses on textual information solely, MEL aims at achieving human-level disambiguation among entities in images, texts, and knowledge bases. Due to the lack of sufficient labeled data for the MEL task, we release a large-scale multimodal entity linking dataset M3EL (abbreviated for MultiModal Movie Entity Linking). Specifically, we collect reviews and images of 1,100 movies, extract textual and visual mentions, and label them with entities registered in Wikipedia. In addition, we construct a new baseline method to solve the MEL problem, which models the alignment of textual and visual mentions as a bipartite graph matching problem and solves it with an optimal-transportation-based linking method. Extensive experiments on the M3EL dataset verify the quality of the dataset and the effectiveness of the proposed method. We envision this work to be helpful for soliciting more research effort and applications regarding multimodal computing and inference in the future. We make the dataset and the baseline algorithm publicly available at https://jingrug.github.io/research/M3EL.
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    20
    References
    0
    Citations
    NaN
    KQI
    []