DiffVC: A Non-autoregressive Framework Based on Diffusion Model for Video Captioning

📅 2026-04-09

📈 Citations: 0

✨ Influential: 0

career value

207K/year

🤖 AI Summary

This work addresses the limitations of existing video captioning approaches: autoregressive models suffer from slow generation and error propagation, while non-autoregressive models often yield suboptimal captions due to insufficient multimodal interaction modeling. To overcome these challenges, we introduce diffusion models to video captioning for the first time and propose a discriminative conditional non-autoregressive generation framework. Our method leverages visual features as conditions and employs a discriminative denoiser to enable high-quality text generation through parallel decoding. Evaluated on MSVD, MSR-VTT, and VATEX benchmarks, the proposed approach significantly outperforms current non-autoregressive methods, achieving gains of up to 9.9 in CIDEr and 2.6 in BLEU-4, while maintaining faster inference speed.

Technology Category

Application Category

📝 Abstract

Current video captioning methods usually use an encoder-decoder structure to generate text autoregressively. However, autoregressive methods have inherent limitations such as slow generation speed and large cumulative error. Furthermore, the few non-autoregressive counterparts suffer from deficiencies in generation quality due to the lack of sufficient multimodal interaction modeling. Therefore, we propose a non-autoregressive framework based on Diffusion model for Video Captioning (DiffVC) to address these issues. Its parallel decoding can effectively solve the problems of generation speed and cumulative error. At the same time, our proposed discriminative conditional Diffusion Model can generate higher-quality textual descriptions. Specifically, we first encode the video into a visual representation. During training, Gaussian noise is added to the textual representation of the ground-truth caption. Then, a new textual representation is generated via the discriminative denoiser with the visual representation as a conditional constraint. Finally, we input the new textual representation into a non-autoregressive language model to generate captions. During inference, we directly sample noise from the Gaussian distribution for generation. Experiments on MSVD, MSR-VTT, and VATEX show that our method can outperform previous non-autoregressive methods and achieve comparable performance to autoregressive methods, e.g., it achieved a maximum improvement of 9.9 on the CIDEr and improvement of 2.6 on the B@4, while having faster generation speed. The source code will be available soon.

Problem

Research questions and friction points this paper is trying to address.

video captioning

autoregressive generation

non-autoregressive generation

multimodal interaction

generation quality

Innovation

Methods, ideas, or system contributions that make the work stand out.

non-autoregressive

diffusion model

video captioning