🤖 AI Summary
Diffusion models excel in image generation but remain inferior to autoregressive methods for discrete sequence generation tasks such as image captioning. To address this limitation, we propose Show, Suggest and Tell (SST), the first framework that integrates a diffusion-based denoising model—not as a direct text generator—but as a semantic *suggestion engine* within an autoregressive Transformer pipeline. Specifically, the diffusion module produces multi-granularity vision-language suggestions, which are fused with the decoder’s input via a joint prompting mechanism. This design synergistically leverages diffusion’s bidirectional contextual modeling capability and autoregression’s strong sequential linguistic structure. Experiments on COCO demonstrate that SST achieves 125.1 CIDEr-D without reinforcement learning, surpassing prior autoregressive and diffusion-based state-of-the-art methods by +1.5 and +2.5 points, respectively. SST establishes a novel paradigm of collaborative generation, unifying diffusion and autoregressive modeling for captioning.
📝 Abstract
Diffusion Denoising models demonstrated impressive results across generative Computer Vision tasks, but they still fail to outperform standard autoregressive solutions in the discrete domain, and only match them at best. In this work, we propose a different paradigm by adopting diffusion models to provide suggestions to the autoregressive generation rather than replacing them. By doing so, we combine the bidirectional and refining capabilities of the former with the strong linguistic structure provided by the latter. To showcase its effectiveness, we present Show, Suggest and Tell (SST), which achieves State-of-the-Art results on COCO, among models in a similar setting. In particular, SST achieves 125.1 CIDEr-D on the COCO dataset without Reinforcement Learning, outperforming both autoregressive and diffusion model State-of-the-Art results by 1.5 and 2.5 points. On top of the strong results, we performed extensive experiments to validate the proposal and analyze the impact of the suggestion module. Results demonstrate a positive correlation between suggestion and caption quality, overall indicating a currently underexplored but promising research direction. Code will be available at: https://github.com/jchenghu/show_suggest_tell.