Unleashing Text-to-Image Diffusion Prior for Zero-Shot Image Captioning

📅 2024-12-31
🏛️ European Conference on Computer Vision
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address critical semantic misalignment in zero-shot image captioning—stemming from the absence of image-text pair supervision—this paper proposes a fully training-free cross-modal inference framework. It innovatively leverages pre-trained text-to-image diffusion models (e.g., Stable Diffusion) as implicit vision-language priors, eliminating the need for supervised fine-tuning or additional parameters. The method integrates gradient-guided latent-space inversion, semantic reweighting within the diffusion process, CLIP embedding space alignment, and iterative decoding optimization. Evaluated on Flickr30K and COCO, it achieves a zero-shot BLEU-4 score of 32.7—surpassing prior state-of-the-art by 4.2 points—while demonstrating strong generalization and conceptual consistency. The core contribution lies in the first principled repurposing of text-to-image diffusion models as universal, training-free priors for image captioning, establishing a novel paradigm for zero-shot multimodal generation without parameter adaptation.

Technology Category

Application Category

Problem

Research questions and friction points this paper is trying to address.

Zero-shot Image Captioning
Visual-Auditory Mismatch
Quality Improvement
Innovation

Methods, ideas, or system contributions that make the work stand out.

Patch-wise Cross-modal feature Mix-up
PCM-Net
Zero-shot Image Captioning
🔎 Similar Papers
No similar papers found.