video-SALMONN-o1: Reasoning-enhanced Audio-visual Large Language Model

📅 2025-02-17
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing video understanding research predominantly focuses on mathematical reasoning or unimodal visual processing, neglecting general-purpose audio-visual joint reasoning. This work introduces the first large multimodal model explicitly designed for general video understanding and enhanced with structured reasoning capabilities, overcoming modality and task limitations. Methodologically, we propose process-oriented Direct Preference Optimization (pDPO), integrating step-level multimodal reward modeling with contrastive step selection. We further construct RivaBench—the first high-quality audio-visual reasoning benchmark—and demonstrate zero-shot synthetic video detection. Experiments show that our model achieves 3–8% absolute accuracy gains over LLaVA-OneVision across multiple video reasoning benchmarks; pDPO yields 6–8% improvements over supervised fine-tuning on RivaBench. These results significantly advance large-model capabilities in audio-visual co-reasoning.

Technology Category

Application Category

📝 Abstract
While recent advancements in reasoning optimization have significantly enhanced the capabilities of large language models (LLMs), existing efforts to improve reasoning have been limited to solving mathematical problems and focusing on visual graphical inputs, neglecting broader applications in general video understanding.This paper proposes video-SALMONN-o1, the first open-source reasoning-enhanced audio-visual LLM designed for general video understanding tasks. To enhance its reasoning abilities, we develop a reasoning-intensive dataset featuring challenging audio-visual questions with step-by-step solutions. We also propose process direct preference optimization (pDPO), which leverages contrastive step selection to achieve efficient step-level reward modelling tailored for multimodal inputs. Additionally, we introduce RivaBench, the first reasoning-intensive video understanding benchmark, featuring over 4,000 high-quality, expert-curated question-answer pairs across scenarios such as standup comedy, academic presentations, and synthetic video detection. video-SALMONN-o1 achieves 3-8% accuracy improvements over the LLaVA-OneVision baseline across different video reasoning benchmarks. Besides, pDPO achieves 6-8% improvements compared to the supervised fine-tuning model on RivaBench. Enhanced reasoning enables video-SALMONN-o1 zero-shot synthetic video detection capabilities.
Problem

Research questions and friction points this paper is trying to address.

Enhance general video understanding
Develop reasoning-intensive audio-visual dataset
Introduce multimodal step-level reward modeling
Innovation

Methods, ideas, or system contributions that make the work stand out.

Open-source audio-visual LLM
Process direct preference optimization
Reasoning-intensive video benchmark
🔎 Similar Papers