SONIC-O1: A Real-World Benchmark for Evaluating Multimodal Large Language Models on Audio-Video Understanding

📅 2026-01-29

📈 Citations: 0

✨ Influential: 0

career value

201K/year

🤖 AI Summary

This work addresses the lack of systematic evaluation of multimodal large language models (MLLMs) on real-world, temporally aligned audio-visual data, as most existing studies focus on static images. We introduce a high-quality, fully human-verified benchmark spanning 13 realistic conversational domains, featuring demographic metadata and supporting open-ended summarization, multiple-choice question answering, and temporal grounding with explicit reasoning justification. Through comprehensive multi-task evaluation and cross-model comparison, we reveal a performance gap of up to 22.6% between closed-source and open-source models on temporal grounding tasks, and demonstrate significant performance degradation across different demographic groups. These findings highlight critical limitations in current MLLMs regarding social robustness and temporal understanding in authentic audio-visual contexts.

Technology Category

Application Category

📝 Abstract

Multimodal Large Language Models (MLLMs) are a major focus of recent AI research. However, most prior work focuses on static image understanding, while their ability to process sequential audio-video data remains underexplored. This gap highlights the need for a high-quality benchmark to systematically evaluate MLLM performance in a real-world setting. We introduce SONIC-O1, a comprehensive, fully human-verified benchmark spanning 13 real-world conversational domains with 4,958 annotations and demographic metadata. SONIC-O1 evaluates MLLMs on key tasks, including open-ended summarization, multiple-choice question (MCQ) answering, and temporal localization with supporting rationales (reasoning). Experiments on closed- and open-source models reveal limitations. While the performance gap in MCQ accuracy between two model families is relatively small, we observe a substantial 22.6% performance difference in temporal localization between the best performing closed-source and open-source models. Performance further degrades across demographic groups, indicating persistent disparities in model behavior. Overall, SONIC-O1 provides an open evaluation suite for temporally grounded and socially robust multimodal understanding. We release SONIC-O1 for reproducibility and research: Project page: https://vectorinstitute.github.io/sonic-o1/ Dataset: https://huggingface.co/datasets/vector-institute/sonic-o1 Github: https://github.com/vectorinstitute/sonic-o1 Leaderboard: https://huggingface.co/spaces/vector-institute/sonic-o1-leaderboard

Problem

Research questions and friction points this paper is trying to address.

Multimodal Large Language Models

Audio-Video Understanding

Benchmark

Temporal Localization

Real-World Evaluation

Innovation

Methods, ideas, or system contributions that make the work stand out.

multimodal large language models

audio-video understanding

temporal localization