🤖 AI Summary
Existing audio-visual understanding benchmarks are largely confined to short video clips, failing to address the cross-modal comprehension demands of real-world scenarios involving videos tens of minutes long. To bridge this gap, this work proposes LVOmniBench—the first multimodal large model evaluation benchmark specifically designed for long-form audio-visual content. It comprises 275 high-quality videos ranging from 10 to 90 minutes in duration and 1,014 multidimensional question-answer pairs, enabling systematic assessment of models’ capabilities in long-term memory retention, temporal localization, fine-grained semantic reasoning, and multimodal alignment. Built upon carefully curated and meticulously annotated open-source data, LVOmniBench fills a critical void in long-video understanding evaluation. Empirical results reveal that current omnimodal large language models still face significant challenges on this task: open-source models generally achieve accuracy below 35%, while the top-performing Gemini 3 Pro reaches only 65%.
📝 Abstract
Recent advancements in omnimodal large language models (OmniLLMs) have significantly improved the comprehension of audio and video inputs. However, current evaluations primarily focus on short audio and video clips ranging from 10 seconds to 5 minutes, failing to reflect the demands of real-world applications, where videos typically run for tens of minutes. To address this critical gap, we introduce LVOmniBench, a new benchmark designed specifically for the cross-modal comprehension of long-form audio and video. This dataset comprises high-quality videos sourced from open platforms that feature rich audio-visual dynamics. Through rigorous manual selection and annotation, LVOmniBench comprises 275 videos, ranging in duration from 10 to 90 minutes, and 1,014 question-answer (QA) pairs. LVOmniBench aims to rigorously evaluate the capabilities of OmniLLMs across domains, including long-term memory, temporal localization, fine-grained understanding, and multimodal perception. Our extensive evaluation reveals that current OmniLLMs encounter significant challenges when processing extended audio-visual inputs. Open-source models generally achieve accuracies below 35%, whereas the Gemini 3 Pro reaches a peak accuracy of approximately 65%. We anticipate that this dataset, along with our empirical findings, will stimulate further research and the development of advanced models capable of resolving complex cross-modal understanding problems within long-form audio-visual contexts.