OmniVideoBench: Towards Audio-Visual Understanding Evaluation for Omni MLLMs

📅 2025-10-12
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing video understanding benchmarks neglect collaborative reasoning between audio and visual modalities, limiting their ability to assess cross-modal complementarity and logical consistency. Method: We introduce AVC-Bench, the first large-scale benchmark dedicated to audio-visual collaborative reasoning, comprising 1,000 human-verified question-answer pairs with step-by-step reasoning annotations. It covers 13 complex reasoning categories—including temporal reasoning, causal inference, and cross-modal localization—and rigorously enforces modality complementarity and reasoning coherence; all samples undergo multi-round manual validation for high quality and answer uniqueness. Contribution/Results: Evaluation on state-of-the-art multimodal large language models reveals substantial performance gaps: open-source models lag significantly behind proprietary ones, and both fall markedly short of human performance. These results confirm AVC-Bench’s high difficulty and strong diagnostic utility for probing fine-grained audio-visual reasoning capabilities.

Technology Category

Application Category

📝 Abstract
Recent advances in multimodal large language models (MLLMs) have demonstrated substantial potential in video understanding. However, existing benchmarks fail to comprehensively evaluate synergistic reasoning capabilities across audio and visual modalities, often neglecting either one of the modalities or integrating them in a logically inconsistent manner. To bridge this gap, we introduce OmniVideoBench, a large-scale and rigorously designed benchmark dedicated to assessing synergistic audio-visual understanding, with a strong emphasis on modality complementarity and logical consistency. Specifically, OmniVideoBench comprises 1000 high-quality question-answer(QA) pairs, each annotated with step-by-step reasoning traces, derived from 628 diverse videos ranging from several seconds to 30 minutes, and manually verified to guarantee complete correctness and uniqueness. Moreover, OmniVideoBench encompasses 13 carefully designed question types, covering temporal reasoning, spatial localization, counting, causal inference, summarization, and beyond, thereby capturing the essential challenges of video understanding. Evaluation of multiple MLLMs on OmniVideoBench reveals a pronounced gap between model performance and human reasoning, with open-source models lagging significantly behind their closed-source counterparts, underscoring the inherent difficulty of genuine audio-visual reasoning. We will release OmniVideoBench to foster the development of MLLMs with stronger and more generalizable reasoning capabilities.
Problem

Research questions and friction points this paper is trying to address.

Evaluates synergistic audio-visual reasoning in multimodal models
Addresses gaps in logical consistency across video and audio
Assesses diverse video understanding tasks with human-verified QA pairs
Innovation

Methods, ideas, or system contributions that make the work stand out.

Introduces a large-scale audio-visual evaluation benchmark
Features 1000 QA pairs with step-by-step reasoning traces
Encompasses 13 question types for comprehensive assessment
🔎 Similar Papers
No similar papers found.
C
Caorui Li
Y
Yu Chen
Y
Yiyan Ji
J
Jin Xu
Zhenyu Cui
Zhenyu Cui
Associate Professor, School of Business, Stevens Institute of Technology
Financial EngineeringFinancial TechnologyDerivative pricingInsurance Analytics
S
Shihao Li
Yuanxing Zhang
Yuanxing Zhang
Kuaishou Technology
Recommender SystemLarge Language ModelVideo Understanding
J
Jiafu Tang
Z
Zhenghao Song
D
Dingling Zhang
Y
Ying He
H
Haoxiang Liu
Y
Yuxuan Wang
Q
Qiufeng Wang
Z
Zhenhe Wu
Jiehui Luo
Jiehui Luo
University of Notre Dame
Zhiyu Pan
Zhiyu Pan
Department of Automation, Tsinghua University
Computer VisionBiometrics
W
Weihao Xie
C
Chenchen Zhang
Z
Zhaohui Wang
Jiayi Tian
Jiayi Tian
University of California, Santa Barbara
LLM Efficiency
Y
Yanghai Wang
Z
Zhe Cao
M
Minxin Dai
K
Ke Wang