Diffusion Is Your Friend in Show, Suggest and Tell

📅 2025-12-10
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Diffusion models excel in image generation but remain inferior to autoregressive methods for discrete sequence generation tasks such as image captioning. To address this limitation, we propose Show, Suggest and Tell (SST), the first framework that integrates a diffusion-based denoising model—not as a direct text generator—but as a semantic *suggestion engine* within an autoregressive Transformer pipeline. Specifically, the diffusion module produces multi-granularity vision-language suggestions, which are fused with the decoder’s input via a joint prompting mechanism. This design synergistically leverages diffusion’s bidirectional contextual modeling capability and autoregression’s strong sequential linguistic structure. Experiments on COCO demonstrate that SST achieves 125.1 CIDEr-D without reinforcement learning, surpassing prior autoregressive and diffusion-based state-of-the-art methods by +1.5 and +2.5 points, respectively. SST establishes a novel paradigm of collaborative generation, unifying diffusion and autoregressive modeling for captioning.

Technology Category

Application Category

📝 Abstract
Diffusion Denoising models demonstrated impressive results across generative Computer Vision tasks, but they still fail to outperform standard autoregressive solutions in the discrete domain, and only match them at best. In this work, we propose a different paradigm by adopting diffusion models to provide suggestions to the autoregressive generation rather than replacing them. By doing so, we combine the bidirectional and refining capabilities of the former with the strong linguistic structure provided by the latter. To showcase its effectiveness, we present Show, Suggest and Tell (SST), which achieves State-of-the-Art results on COCO, among models in a similar setting. In particular, SST achieves 125.1 CIDEr-D on the COCO dataset without Reinforcement Learning, outperforming both autoregressive and diffusion model State-of-the-Art results by 1.5 and 2.5 points. On top of the strong results, we performed extensive experiments to validate the proposal and analyze the impact of the suggestion module. Results demonstrate a positive correlation between suggestion and caption quality, overall indicating a currently underexplored but promising research direction. Code will be available at: https://github.com/jchenghu/show_suggest_tell.
Problem

Research questions and friction points this paper is trying to address.

Combines diffusion and autoregressive models for image captioning
Proposes Show, Suggest and Tell to enhance caption quality
Achieves state-of-the-art results on COCO dataset without reinforcement learning
Innovation

Methods, ideas, or system contributions that make the work stand out.

Diffusion models suggest to autoregressive generation
Combines bidirectional refining with linguistic structure
Achieves state-of-the-art results without reinforcement learning
🔎 Similar Papers
No similar papers found.