DiffVC: A Non-autoregressive Framework Based on Diffusion Model for Video Captioning

📅 2026-04-09
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the limitations of existing video captioning approaches: autoregressive models suffer from slow generation and error propagation, while non-autoregressive models often yield suboptimal captions due to insufficient multimodal interaction modeling. To overcome these challenges, we introduce diffusion models to video captioning for the first time and propose a discriminative conditional non-autoregressive generation framework. Our method leverages visual features as conditions and employs a discriminative denoiser to enable high-quality text generation through parallel decoding. Evaluated on MSVD, MSR-VTT, and VATEX benchmarks, the proposed approach significantly outperforms current non-autoregressive methods, achieving gains of up to 9.9 in CIDEr and 2.6 in BLEU-4, while maintaining faster inference speed.
📝 Abstract
Current video captioning methods usually use an encoder-decoder structure to generate text autoregressively. However, autoregressive methods have inherent limitations such as slow generation speed and large cumulative error. Furthermore, the few non-autoregressive counterparts suffer from deficiencies in generation quality due to the lack of sufficient multimodal interaction modeling. Therefore, we propose a non-autoregressive framework based on Diffusion model for Video Captioning (DiffVC) to address these issues. Its parallel decoding can effectively solve the problems of generation speed and cumulative error. At the same time, our proposed discriminative conditional Diffusion Model can generate higher-quality textual descriptions. Specifically, we first encode the video into a visual representation. During training, Gaussian noise is added to the textual representation of the ground-truth caption. Then, a new textual representation is generated via the discriminative denoiser with the visual representation as a conditional constraint. Finally, we input the new textual representation into a non-autoregressive language model to generate captions. During inference, we directly sample noise from the Gaussian distribution for generation. Experiments on MSVD, MSR-VTT, and VATEX show that our method can outperform previous non-autoregressive methods and achieve comparable performance to autoregressive methods, e.g., it achieved a maximum improvement of 9.9 on the CIDEr and improvement of 2.6 on the B@4, while having faster generation speed. The source code will be available soon.
Problem

Research questions and friction points this paper is trying to address.

video captioning
autoregressive generation
non-autoregressive generation
multimodal interaction
generation quality
Innovation

Methods, ideas, or system contributions that make the work stand out.

non-autoregressive
diffusion model
video captioning
multimodal interaction
parallel decoding
🔎 Similar Papers
No similar papers found.
J
Junbo Wang
School of Software, Northwestern Polytechnical University, Xi’an 710129, China
L
Liangyu Fu
School of Software, Northwestern Polytechnical University, Xi’an 710129, China
Y
Yuke Li
School of Software, Northwestern Polytechnical University, Xi’an 710129, China
Y
Yining Zhu
School of Computer Science, Northwestern Polytechnical University, Xi’an 710129, China
Ya Jing
Ya Jing
ByteDance Research
Computer VisionRoboticsCross-modal Learning
X
Xuecheng Wu
School of Computer Science and Technology, Xi’an Jiaotong University, Xi’an 710049, China
Jiangbin Zheng
Jiangbin Zheng
Zhejiang University & Westlake University
AI for Life ScienceNatural Language ProcessingComputer VisionAI for Sign Language