JointAVBench: A Benchmark for Joint Audio-Visual Reasoning Evaluation

📅 2025-12-14

📈 Citations: 0

✨ Influential: 0

career value

184K/year

🤖 AI Summary

Existing benchmarks inadequately evaluate omni-modal large language models (Omni-LLMs) due to insufficient coverage of multimodal dependencies, diverse audio modalities (e.g., speech, sound events, music, vocal traits), and single-/cross-/full-scenario spans. To address this, we introduce AV-Bench—the first rigorous benchmark mandating joint audiovisual understanding. Its design comprises three stringent dimensions: strong audiovisual coupling, five cognitive capability axes, and three scenario spans (single, cross-, and full-modality). We further develop an automated, multi-large-model-collaborative QA synthesis pipeline ensuring questions necessitate integrated audiovisual reasoning. High-quality annotation is achieved via multi-granularity scene segmentation and cross-modal alignment. Experiments reveal that state-of-the-art Omni-LLMs achieve only 62.6% average accuracy—substantially below human performance—highlighting cross-scenario joint reasoning as the critical bottleneck. AV-Bench thus provides a precise, challenging evaluation standard for advancing multimodal foundation models.

Technology Category

Application Category

📝 Abstract

Understanding videos inherently requires reasoning over both visual and auditory information. To properly evaluate Omni-Large Language Models (Omni-LLMs), which are capable of processing multi-modal information including vision and audio, an effective benchmark must comprehensively cover three key aspects: (1) multi-modal dependency (i.e., questions that cannot be answered using vision or audio alone), (2) diverse audio information types (e.g., speech, sound events), and (3) varying scene spans. However, existing datasets fall short in one or more of these dimensions, limiting strict and comprehensive evaluation. To address this gap, we introduce JointAVBench, a novel benchmark with strict audio-video correlation, spanning five cognitive dimensions, four audio information types (speech, sound events, music, vocal traits), and three scene spans (single-, cross-, and full-scene). Given the high cost of manual annotation, we propose an automated pipeline that leverages state-of-the-art vision-LLMs, audio-LLMs, and general-purpose LLMs to synthesize questions and answers that strictly require joint audio-visual understanding. We evaluate leading vision-only, audio-only, and Omni-LLMs on our dataset. Results show that even the best-performing Omni-LLM achieves an average accuracy of only 62.6%, outperforming uni-modal baselines but revealing substantial room for improvement, especially in cross-scene reasoning.

Problem

Research questions and friction points this paper is trying to address.

Existing datasets lack comprehensive audio-visual reasoning evaluation benchmarks

Current benchmarks fail to cover multi-modal dependency and diverse audio types

Manual annotation for joint audio-visual understanding tasks is costly

Innovation

Methods, ideas, or system contributions that make the work stand out.

Automated pipeline synthesizes joint audio-visual questions

Benchmark spans five cognitive dimensions and four audio types

Leverages vision-LLMs, audio-LLMs, and general-purpose LLMs

🔎 Similar Papers

What Are They Doing? Joint Audio-Speech Co-Reasoning