Cross-modal Semantic Matching Generative Adversarial Networks for Text-to-Image Synthesis
2021
Synthesizing photo-realistic images based on text descriptions is a challenging image generation problem. Although many recent approaches have significantly advanced the performance of text-to-image generation, to guarantee semantic matchings between the text description and synthesized image remains very challenging. In this paper, we propose a new model, Cross-modal Semantic Matching Generative Adversarial Networks (CSM-GAN), to improve the semantic consistency between text description and synthesized image for a fine-grained text-to-image generation. Two new modules are proposed in CSM-GAN: Text Encoder Module (TEM) and Textual-Visual Semantic Matching Module (TVSMM). TVSMM is aimed at making \textcolor{red}{the distance of the pairs of synthesized image and its corresponding text description closer}, in global semantic embedding space, than those of mismatched pairs. This improves the semantic consistency and consequently, the generalizability of CSM-GAN. In TEM, we introduce Text Convolutional Neural Networks (Text\_CNNs) to capture and highlight local visual features in textual descriptions. Thorough experiments on two public benchmark datasets demonstrated the superiority of CSM-GAN over other representative state-of-the-art methods.
Keywords:
- Correction
- Source
- Cite
- Save
- Machine Reading By IdeaReader
0
References
1
Citations
NaN
KQI