Diffusion Is Your Friend in Show, Suggest and Tell

#1 Diffusion Is Your Friend in Show, Suggest and Tell [PDF³] [Copy] [Kimi²] [REL]

Authors: Jia Cheng Hu, Roberto Cavicchioli, Alessandro Capotondi

Diffusion Denoising models demonstrated impressive results across generative Computer Vision tasks, but they still fail to outperform standard autoregressive solutions in the discrete domain, and only match them at best. In this work, we propose a different paradigm by adopting diffusion models to provide suggestions to the autoregressive generation rather than replacing them. By doing so, we combine the bidirectional and refining capabilities of the former with the strong linguistic structure provided by the latter. To showcase its effectiveness, we present Show, Suggest and Tell (SST), which achieves State-of-the-Art results on COCO, among models in a similar setting. In particular, SST achieves 125.1 CIDEr-D on the COCO dataset without Reinforcement Learning, outperforming both autoregressive and diffusion model State-of-the-Art results by 1.5 and 2.5 points. On top of the strong results, we performed extensive experiments to validate the proposal and analyze the impact of the suggestion module. Results demonstrate a positive correlation between suggestion and caption quality, overall indicating a currently underexplored but promising research direction. Code will be available at: https://github.com/jchenghu/show\_suggest\_tell.

Subjects: Computer Vision and Pattern Recognition , Computation and Language

Publish: 2025-12-10 19:44:19 UTC

2512.10038

#1 Diffusion Is Your Friend in Show, Suggest and Tell [PDF3] [Copy] [Kimi2] [REL]

#1 Diffusion Is Your Friend in Show, Suggest and Tell [PDF³] [Copy] [Kimi²] [REL]