Deep Cross-Modal Image-Voice Retrieval in Remote Sensing

2020 
With the rapid progress of satellite and aircraft technologies, cross-modal remote sensing image–voice retrieval has been studied in geography recently. However, there still exist some bottlenecks: how to consider the characteristics of remote sensing data adequately and how to reduce the memory and improve the retrieval efficiency in large-scale remote sensing data. In this article, we propose a novel deep cross-modal remote sensing image–voice retrieval approach, namely, deep image–voice retrieval (DIVR), to capture more information of remote sensing data to generate hash codes with low memory and fast retrieval properties. Especially, the DIVR approach proposes inception dilated convolution module to capture multiscale contextual information of remote sensing images and voices. Moreover, in order to enhance cross-modal similarity, the deep features’ similarity term is designed to make paired similar deep features as close as possible and paired dissimilar deep features as mutually far as possible. In addition, the quantization error term is designed to drive hash-like codes to approximate hash codes, which can effectively reduce the quantization error for hash codes’ learning. Extensive experimental results on three remote sensing image–voice data sets show that the proposed DIVR approach can outperform other cross-modal retrieval approaches.
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    51
    References
    11
    Citations
    NaN
    KQI
    []