See, Hear, and Understand: Benchmarking Audiovisual Human Speech Understanding in Multimodal Large Language Models

📅 2025-12-01
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing multimodal large language models (MLLMs) lack fine-grained evaluation of audio-visual alignment—specifically, the “who said what and when” capability—in video understanding. Method: We introduce AV-SpeakerBench, the first speaker-centric video benchmark for reasoning about speaker identity, spoken content, and precise temporal localization. It comprises 3,212 multiple-choice questions and establishes an evaluation framework where the speaker serves as the fundamental reasoning unit. Questions are semantically designed to jointly encode audio-visual dependencies, with expert annotations ensuring millisecond-level temporal accuracy and cross-modal consistency. Contribution/Results: Experiments show Gemini 2.5 Pro achieves the highest performance; open-weight Qwen3-Omni-30B approaches Gemini 2.0 Flash, revealing that the bottleneck lies in audio-visual fusion—not visual perception. This work provides the first systematic assessment of speaker-level audio-visual comprehension in MLLMs, advancing research on fine-grained multimodal alignment.

Technology Category

Application Category

📝 Abstract
Multimodal large language models (MLLMs) are expected to jointly interpret vision, audio, and language, yet existing video benchmarks rarely assess fine-grained reasoning about human speech. Many tasks remain visually solvable or only coarsely evaluate speech, offering limited insight into whether models can align who speaks, what is said, and when it occurs. We introduce AV-SpeakerBench, a curated benchmark of 3,212 multiple-choice questions focused on speaker-centric audiovisual reasoning in real-world videos. It features: (1) a speaker-centered formulation that treats speakers-not scenes-as the core reasoning unit; (2) fusion-grounded question design embedding audiovisual dependencies into question semantics; and (3) expert-curated annotations ensuring temporal precision and cross-modal validity. Comprehensive evaluations show that the Gemini family consistently outperforms open-source systems, with Gemini 2.5 Pro achieving the best results. Among open models, Qwen3-Omni-30B approaches Gemini 2.0 Flash but remains far behind Gemini 2.5 Pro, primarily due to weaker audiovisual fusion rather than visual perception. We believe AV-SpeakerBench establishes a rigorous foundation for advancing fine-grained audiovisual reasoning in future multimodal systems.
Problem

Research questions and friction points this paper is trying to address.

Benchmarks lack fine-grained audiovisual reasoning about human speech
Models struggle to align speakers, content, and timing in videos
Need rigorous evaluation for speaker-centric multimodal understanding in real-world contexts
Innovation

Methods, ideas, or system contributions that make the work stand out.

Speaker-centered reasoning unit for audiovisual analysis
Fusion-grounded question design embedding cross-modal dependencies
Expert-curated annotations ensuring temporal and cross-modal validity
🔎 Similar Papers
No similar papers found.
Le Thien Phuc Nguyen
Le Thien Phuc Nguyen
University of Wisconsin - Madison
Computer VisionDeep LearningMultimodality
Zhuoran Yu
Zhuoran Yu
University of Wisconsin-Madison
Computer VisionMachine Learning
S
Samuel Low Yu Hang
Kookmin University
S
Subin An
Kookmin University
J
Jeongik Lee
Kookmin University
Y
Yohan Ban
Kookmin University
S
SeungEun Chung
Kookmin University
Thanh-Huy Nguyen
Thanh-Huy Nguyen
Carnegie Mellon University
Medical Image Analysis𝗖𝗼𝗺𝗽𝘂𝘁𝗲𝗿 𝗩𝗶𝘀𝗶𝗼𝗻Semi-Supervised Learning
J
JuWan Maeng
Kookmin University
S
Soochahn Lee
Kookmin University
Y
Yong Jae Lee
University of Wisconsin–Madison