MAVERIX: Multimodal Audio-Visual Evaluation Reasoning IndeX

📅 2025-03-27
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing cross-modal evaluation frameworks lack standardized benchmarks for deep audio-visual fusion capabilities. To address this gap, we introduce AV-Bench—the first benchmark dedicated to audio-visual collaborative understanding—comprising 700 real-world videos and 2,556 questions requiring joint audio-visual reasoning across tightly coupled tasks, including synchrony assessment, causal inference, and event localization. We systematically define and evaluate models’ fine-grained audio-visual joint representation learning, encompassing both perceptual grounding and higher-order reasoning. Rigorous human annotation protocols and an open-source evaluation toolkit are established to ensure reproducibility and fairness. Evaluated on state-of-the-art models—including Gemini 1.5 Pro and o1—AV-Bench achieves ~70% accuracy, substantially outperforming prior benchmarks; human experts attain 95.1%, establishing a high-bar challenge. AV-Bench thus fills a critical void in cross-modal perception evaluation and sets a new standard for multimodal intelligence assessment.

Technology Category

Application Category

📝 Abstract
Frontier models have either been language-only or have primarily focused on vision and language modalities. Although recent advancements in models with vision and audio understanding capabilities have shown substantial progress, the field lacks a standardized evaluation framework for thoroughly assessing their cross-modality perception performance. We introduce MAVERIX~(Multimodal Audio-Visual Evaluation Reasoning IndeX), a novel benchmark with 700 videos and 2,556 questions explicitly designed to evaluate multimodal models through tasks that necessitate close integration of video and audio information. MAVERIX uniquely provides models with audiovisual tasks, closely mimicking the multimodal perceptual experiences available to humans during inference and decision-making processes. To our knowledge, MAVERIX is the first benchmark aimed explicitly at assessing comprehensive audiovisual integration. Experiments with state-of-the-art models, including Gemini 1.5 Pro and o1, show performance approaching human levels (around 70% accuracy), while human experts reach near-ceiling performance (95.1%). With standardized evaluation protocols, a rigorously annotated pipeline, and a public toolkit, MAVERIX establishes a challenging testbed for advancing audiovisual multimodal intelligence.
Problem

Research questions and friction points this paper is trying to address.

Lacks standardized evaluation for audio-visual cross-modality models
Needs benchmark to assess video-audio integration in AI
Requires framework mimicking human multimodal perception tasks
Innovation

Methods, ideas, or system contributions that make the work stand out.

Introduces MAVERIX benchmark for audiovisual evaluation
Includes 700 videos and 2,556 cross-modality questions
Standardizes assessment of audiovisual integration in AI
🔎 Similar Papers
No similar papers found.