|Yang Yang||NanJing university|
|Yi-Feng Wu||LAMDA Group, Nanjing University|
|De-Chuan Zhan||Nanjing University|
|Yuan Jiang||Nanjing University|
This paper studies multiple modal representations. The authors propose a novel Multi-modal Multi-instance Multi-label Deep Network (M3DN), which learns the label prediction and exploits label correlation simultaneously based on the Optimal Transport,
In real world applications, complex objects are usually with multiple labels, and can be represented as multiple modal representations, e.g., the complex articles contain text and image information as well as are with multiple annotations. Previous methods assume that the homogeneous multi-modal data are consistent, while in real applications, the raw data are disordered, i.e., the article is constituted with variable number of inconsistent text and image instances. To solve this problem, Multi-modal Multi-instance Multi-label (M3) learning provides a framework for handling such task and has exhibited excellent performance. Besides, how to effectively utilize label correlation is also a challenging issue. In this paper, we propose a novel Multi-modal Multi-instance Multi-label Deep Network (M3DN), which learns the label prediction and exploits label correlation simultaneously based on the Optimal Transport, by considering the consistency principle between different modal bag-level prediction and the learned latent ground label metric. Experiments on benchmark datasets and real world WKG Game-Hub dataset validate the effectiveness of the proposed method.