XGPT: Cross-modal Generative Pre-Training for Image Captioning

Qiaolin Xia,Haoyang Huang,Nan Duan,Dongdong Zhang,Lei Ji,Zhifang Sui,Edward Cui,Taroon Bharti,Ming Zhou

XGPT: Cross-modal Generative Pre-Training for Image Captioning

2021

Qiaolin Xia
Haoyang Huang
Nan Duan
Dongdong Zhang
Lei Ji
Zhifang Sui
Edward Cui
Taroon Bharti
Ming Zhou

In this paper, we propose XGPT, a new method of Cross-modal Generative Pre-Training for Image Captioning that is designed to pre-train text-to-image caption generators through four novel generation tasks, including Adversarial Image Captioning (AIC), Image-conditioned Masked Language Modeling (IMLM), Image-conditioned Denoising Autoencoding (IDA), and Text-conditioned Image Feature Generation (TIFG). As a result, the pre-trained XGPT can obtain new state-of-the-art results on the benchmark datasets, including COCO Captions and Flickr30k Captions. We also use XGPT to generate image captions as data augmentation for the image retrieval task and achieve significant improvement on all recall metrics.

Keywords:

Natural language processing
Image (mathematics)
Noise reduction
task
Speech recognition
Closed captioning
Computer science
Language model
Benchmark (computing)
Modal
Image retrieval
Artificial intelligence

Correction
Source
Cite
Save
Machine Reading By IdeaReader

References

Citations