Unleashing Text-to-Image Diffusion Prior for Zero-Shot Image Captioning

📅 2024-12-31

🏛️ European Conference on Computer Vision

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

To address critical semantic misalignment in zero-shot image captioning—stemming from the absence of image-text pair supervision—this paper proposes a fully training-free cross-modal inference framework. It innovatively leverages pre-trained text-to-image diffusion models (e.g., Stable Diffusion) as implicit vision-language priors, eliminating the need for supervised fine-tuning or additional parameters. The method integrates gradient-guided latent-space inversion, semantic reweighting within the diffusion process, CLIP embedding space alignment, and iterative decoding optimization. Evaluated on Flickr30K and COCO, it achieves a zero-shot BLEU-4 score of 32.7—surpassing prior state-of-the-art by 4.2 points—while demonstrating strong generalization and conceptual consistency. The core contribution lies in the first principled repurposing of text-to-image diffusion models as universal, training-free priors for image captioning, establishing a novel paradigm for zero-shot multimodal generation without parameter adaptation.