SONIC-O1: A Real-World Benchmark for Evaluating Multimodal Large Language Models on Audio-Video Understanding

📅 2026-01-29
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the lack of systematic evaluation of multimodal large language models (MLLMs) on real-world, temporally aligned audio-visual data, as most existing studies focus on static images. We introduce a high-quality, fully human-verified benchmark spanning 13 realistic conversational domains, featuring demographic metadata and supporting open-ended summarization, multiple-choice question answering, and temporal grounding with explicit reasoning justification. Through comprehensive multi-task evaluation and cross-model comparison, we reveal a performance gap of up to 22.6% between closed-source and open-source models on temporal grounding tasks, and demonstrate significant performance degradation across different demographic groups. These findings highlight critical limitations in current MLLMs regarding social robustness and temporal understanding in authentic audio-visual contexts.

Technology Category

Application Category

📝 Abstract
Multimodal Large Language Models (MLLMs) are a major focus of recent AI research. However, most prior work focuses on static image understanding, while their ability to process sequential audio-video data remains underexplored. This gap highlights the need for a high-quality benchmark to systematically evaluate MLLM performance in a real-world setting. We introduce SONIC-O1, a comprehensive, fully human-verified benchmark spanning 13 real-world conversational domains with 4,958 annotations and demographic metadata. SONIC-O1 evaluates MLLMs on key tasks, including open-ended summarization, multiple-choice question (MCQ) answering, and temporal localization with supporting rationales (reasoning). Experiments on closed- and open-source models reveal limitations. While the performance gap in MCQ accuracy between two model families is relatively small, we observe a substantial 22.6% performance difference in temporal localization between the best performing closed-source and open-source models. Performance further degrades across demographic groups, indicating persistent disparities in model behavior. Overall, SONIC-O1 provides an open evaluation suite for temporally grounded and socially robust multimodal understanding. We release SONIC-O1 for reproducibility and research: Project page: https://vectorinstitute.github.io/sonic-o1/ Dataset: https://huggingface.co/datasets/vector-institute/sonic-o1 Github: https://github.com/vectorinstitute/sonic-o1 Leaderboard: https://huggingface.co/spaces/vector-institute/sonic-o1-leaderboard
Problem

Research questions and friction points this paper is trying to address.

Multimodal Large Language Models
Audio-Video Understanding
Benchmark
Temporal Localization
Real-World Evaluation
Innovation

Methods, ideas, or system contributions that make the work stand out.

multimodal large language models
audio-video understanding
temporal localization
real-world benchmark
human-verified evaluation
🔎 Similar Papers
2024-06-09Annual Meeting of the Association for Computational LinguisticsCitations: 13
2024-02-20International Conference on Machine LearningCitations: 30
A
Ahmed Y. Radwan
Vector Institute for Artificial Intelligence, MaRS Centre, Toronto, ON M5G 1L7, Canada
Christos Emmanouilidis
Christos Emmanouilidis
Associate Professor, University of Groningen
Human in the LoopHuman-Centric AIIoTInformation SystemsAsset Lifecycle Management
Hina Tabassum
Hina Tabassum
Associate Professor, York University
6GDeep LearningResource AllocationPerformance Optimization
D
D. Pandya
Vector Institute for Artificial Intelligence, MaRS Centre, Toronto, ON M5G 1L7, Canada
S
Shaina Raza
Vector Institute for Artificial Intelligence, MaRS Centre, Toronto, ON M5G 1L7, Canada