video-SALMONN 2: Captioning-Enhanced Audio-Visual Large Language Models

📅 2025-06-18
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study addresses the challenge of generating fine-grained, high-fidelity descriptions in audio-visual understanding. We propose a lightweight large language model (7B parameters) specifically designed for audio-visual comprehension. Methodologically: (1) we introduce Multi-round Directed Preference Optimization (MrDPO), which dynamically updates both the reference model and LoRA adapters; (2) we fuse audio and visual modalities, incorporating real-world subtitles to guide cross-modal alignment and training; (3) we design a novel evaluation metric balancing completeness and accuracy. Experiments show a 28% reduction in captioning error rate, significantly outperforming GPT-4o and Gemini-1.5-Pro; on video question answering, our model achieves state-of-the-art performance among models of comparable scale. Key contributions include the MrDPO optimization framework, a lightweight audio-visual co-training paradigm, and an interpretable, multi-dimensional evaluation system.

Technology Category

Application Category

📝 Abstract
Videos contain a wealth of information, and generating detailed and accurate descriptions in natural language is a key aspect of video understanding. In this paper, we present video-SALMONN 2, an advanced audio-visual large language model (LLM) with low-rank adaptation (LoRA) designed for enhanced video (with paired audio) captioning through directed preference optimisation (DPO). We propose new metrics to evaluate the completeness and accuracy of video descriptions, which are optimised using DPO. To further improve training, we propose a novel multi-round DPO (MrDPO) approach, which involves periodically updating the DPO reference model, merging and re-initialising the LoRA module as a proxy for parameter updates after each training round (1,000 steps), and incorporating guidance from ground-truth video captions to stabilise the process. Experimental results show that MrDPO significantly enhances video-SALMONN 2's captioning accuracy, reducing the captioning error rates by 28%. The final video-SALMONN 2 model, with just 7 billion parameters, surpasses leading models such as GPT-4o and Gemini-1.5-Pro in video captioning tasks, while maintaining highly competitive performance to the state-of-the-art on widely used video question-answering benchmarks among models of similar size. Codes are available at href{https://github.com/bytedance/video-SALMONN-2}{https://github.com/bytedance/video-SALMONN-2}.
Problem

Research questions and friction points this paper is trying to address.

Enhancing video captioning accuracy using audio-visual LLMs
Developing new metrics for video description evaluation
Improving training stability with multi-round DPO approach
Innovation

Methods, ideas, or system contributions that make the work stand out.

LoRA-enhanced audio-visual LLM for video captioning
Multi-round DPO with periodic reference updates
Ground-truth guided training for stability
🔎 Similar Papers
No similar papers found.