Matching images and texts with multi-head attention network for cross-media hashing retrieval

2021 
Abstract The cross-media hashing retrieval generally encodes multimedia data into a common binary hash space, which can effectively measure the correlation between samples from different modalities. However, in the cross-media retrieval, supervised methods require a lot of manual labels, which leads to the problem of high labor in practical application. Simultaneously, most unsupervised methods do not achieve good results by preserving the correlation between or within modalities. To attack these problems and further improve retrieval performance, this paper proposes an unsupervised cross-media hashing retrieval method based on multi-head attention network, which contains rich semantic information to match images and texts better. Specifically, we make use of a multi-head attention network for generating binary hash code better. At the same time, an auxiliary similarity matrix is constructed to integrate the original neighborhood information from different modalities. Therefore, this method can capture the potential relationships between inter-modal and intra-modal correlations. Furthermore, the method is unsupervised and requires no additional semantic labels, so it has the potential to achieve large-scale cross-media retrieval. In addition, two strategies of batch normalization and replacing hash code generation functions are adopted to optimize the model, and two loss functions are designed to make the performance of our method exceed that of many supervised cross-media hashing retrieval methods. Experiments on three baseline datasets show that our method performs much better than many state-of-the-art methods. The results demonstrate the effectiveness and superiority of our method.
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    59
    References
    0
    Citations
    NaN
    KQI
    []