🤖 AI Summary
This work addresses the lack of systematic evaluation of multimodal large language models (MLLMs) on real-world, temporally aligned audio-visual data, as most existing studies focus on static images. We introduce a high-quality, fully human-verified benchmark spanning 13 realistic conversational domains, featuring demographic metadata and supporting open-ended summarization, multiple-choice question answering, and temporal grounding with explicit reasoning justification. Through comprehensive multi-task evaluation and cross-model comparison, we reveal a performance gap of up to 22.6% between closed-source and open-source models on temporal grounding tasks, and demonstrate significant performance degradation across different demographic groups. These findings highlight critical limitations in current MLLMs regarding social robustness and temporal understanding in authentic audio-visual contexts.
📝 Abstract
Multimodal Large Language Models (MLLMs) are a major focus of recent AI research. However, most prior work focuses on static image understanding, while their ability to process sequential audio-video data remains underexplored. This gap highlights the need for a high-quality benchmark to systematically evaluate MLLM performance in a real-world setting. We introduce SONIC-O1, a comprehensive, fully human-verified benchmark spanning 13 real-world conversational domains with 4,958 annotations and demographic metadata. SONIC-O1 evaluates MLLMs on key tasks, including open-ended summarization, multiple-choice question (MCQ) answering, and temporal localization with supporting rationales (reasoning). Experiments on closed- and open-source models reveal limitations. While the performance gap in MCQ accuracy between two model families is relatively small, we observe a substantial 22.6% performance difference in temporal localization between the best performing closed-source and open-source models. Performance further degrades across demographic groups, indicating persistent disparities in model behavior. Overall, SONIC-O1 provides an open evaluation suite for temporally grounded and socially robust multimodal understanding. We release SONIC-O1 for reproducibility and research: Project page: https://vectorinstitute.github.io/sonic-o1/ Dataset: https://huggingface.co/datasets/vector-institute/sonic-o1 Github: https://github.com/vectorinstitute/sonic-o1 Leaderboard: https://huggingface.co/spaces/vector-institute/sonic-o1-leaderboard