🤖 AI Summary
To address critical semantic misalignment in zero-shot image captioning—stemming from the absence of image-text pair supervision—this paper proposes a fully training-free cross-modal inference framework. It innovatively leverages pre-trained text-to-image diffusion models (e.g., Stable Diffusion) as implicit vision-language priors, eliminating the need for supervised fine-tuning or additional parameters. The method integrates gradient-guided latent-space inversion, semantic reweighting within the diffusion process, CLIP embedding space alignment, and iterative decoding optimization. Evaluated on Flickr30K and COCO, it achieves a zero-shot BLEU-4 score of 32.7—surpassing prior state-of-the-art by 4.2 points—while demonstrating strong generalization and conceptual consistency. The core contribution lies in the first principled repurposing of text-to-image diffusion models as universal, training-free priors for image captioning, establishing a novel paradigm for zero-shot multimodal generation without parameter adaptation.