video-SALMONN-o1: Reasoning-enhanced Audio-visual Large Language Model

📅 2025-02-17

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

Existing video understanding research predominantly focuses on mathematical reasoning or unimodal visual processing, neglecting general-purpose audio-visual joint reasoning. This work introduces the first large multimodal model explicitly designed for general video understanding and enhanced with structured reasoning capabilities, overcoming modality and task limitations. Methodologically, we propose process-oriented Direct Preference Optimization (pDPO), integrating step-level multimodal reward modeling with contrastive step selection. We further construct RivaBench—the first high-quality audio-visual reasoning benchmark—and demonstrate zero-shot synthetic video detection. Experiments show that our model achieves 3–8% absolute accuracy gains over LLaVA-OneVision across multiple video reasoning benchmarks; pDPO yields 6–8% improvements over supervised fine-tuning on RivaBench. These results significantly advance large-model capabilities in audio-visual co-reasoning.

Technology Category

Application Category

📝 Abstract

While recent advancements in reasoning optimization have significantly enhanced the capabilities of large language models (LLMs), existing efforts to improve reasoning have been limited to solving mathematical problems and focusing on visual graphical inputs, neglecting broader applications in general video understanding.This paper proposes video-SALMONN-o1, the first open-source reasoning-enhanced audio-visual LLM designed for general video understanding tasks. To enhance its reasoning abilities, we develop a reasoning-intensive dataset featuring challenging audio-visual questions with step-by-step solutions. We also propose process direct preference optimization (pDPO), which leverages contrastive step selection to achieve efficient step-level reward modelling tailored for multimodal inputs. Additionally, we introduce RivaBench, the first reasoning-intensive video understanding benchmark, featuring over 4,000 high-quality, expert-curated question-answer pairs across scenarios such as standup comedy, academic presentations, and synthetic video detection. video-SALMONN-o1 achieves 3-8% accuracy improvements over the LLaVA-OneVision baseline across different video reasoning benchmarks. Besides, pDPO achieves 6-8% improvements compared to the supervised fine-tuning model on RivaBench. Enhanced reasoning enables video-SALMONN-o1 zero-shot synthetic video detection capabilities.

Problem

Research questions and friction points this paper is trying to address.

Enhance general video understanding

Develop reasoning-intensive audio-visual dataset

Introduce multimodal step-level reward modeling

Innovation

Methods, ideas, or system contributions that make the work stand out.

Open-source audio-visual LLM

Process direct preference optimization

Reasoning-intensive video benchmark

🔎 Similar Papers

Chrono: A Simple Blueprint for Representing Time in MLLMs

2024-06-26Citations: 4

Video-Language Understanding: A Survey from Model Architecture, Model Training, and Data Perspectives

2024-06-09Annual Meeting of the Association for Computational LinguisticsCitations: 13

From Image to Video, what do we need in multimodal LLMs?

2024-04-18arXiv.orgCitations: 8

Authors to Follow