DiaDem: Advancing Dialogue Descriptions in Audiovisual Video Captioning for Multimodal Large Language Models

📅 2026-01-27
📈 Citations: 2
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the challenges of ambiguous speaker attribution and inaccurate transcription in existing audio-visual description models within conversational scenarios. To this end, the authors propose DiaDem, a multimodal large language model that effectively integrates audio, visual, and textual information through supervised fine-tuning on synthetically generated data and a two-stage difficulty-stratified GRPO reinforcement learning strategy. The contributions include the construction of DiaDemBench—the first systematic evaluation benchmark for conversational description—and a novel difficulty-aware optimization approach. Experimental results demonstrate that DiaDem significantly outperforms the Gemini family of models on DiaDemBench, achieving state-of-the-art performance in both accuracy and faithfulness of conversational descriptions, while maintaining competitive results on general audio-visual captioning tasks.

Technology Category

Application Category

📝 Abstract
Accurate dialogue description in audiovisual video captioning is crucial for downstream understanding and generation tasks. However, existing models generally struggle to produce faithful dialogue descriptions within audiovisual captions. To mitigate this limitation, we propose DiaDem, a powerful audiovisual video captioning model capable of generating captions with more precise dialogue descriptions while maintaining strong overall performance. We first synthesize a high-quality dataset for SFT, then employ a difficulty-partitioned two-stage GRPO strategy to further enhance dialogue descriptions. To enable systematic evaluation of dialogue description capabilities, we introduce DiaDemBench, a comprehensive benchmark designed to evaluate models across diverse dialogue scenarios, emphasizing both speaker attribution accuracy and utterance transcription fidelity in audiovisual captions. Extensive experiments on DiaDemBench reveal even commercial models still exhibit substantial room for improvement in dialogue-aware captioning. Notably, DiaDem not only outperforms the Gemini series in dialogue description accuracy but also achieves competitive performance on general audiovisual captioning benchmarks, demonstrating its overall effectiveness.
Problem

Research questions and friction points this paper is trying to address.

dialogue description
audiovisual video captioning
speaker attribution
utterance transcription
multimodal large language models
Innovation

Methods, ideas, or system contributions that make the work stand out.

audiovisual video captioning
dialogue description
difficulty-partitioned GRPO
DiaDemBench
multimodal large language models
🔎 Similar Papers