DAVE: Diagnostic benchmark for Audio Visual Evaluation

📅 2025-03-12

📈 Citations: 0

✨ Influential: 0

career value

201K/year

🤖 AI Summary

Existing audio-visual multimodal benchmarks suffer from severe visual bias—answers can be inferred from images alone—and employ coarse-grained evaluation, hindering precise diagnosis of deficits in visual understanding, audio parsing, or cross-modal alignment. To address this, we propose DAVE, the first diagnostic benchmark for audio-visual joint understanding. Our method introduces: (i) audio-visual necessity constraints to enforce genuine bimodal collaboration; (ii) an atomic-level, decoupled evaluation framework that isolates and quantifies three core capabilities—visual perception, auditory analysis, and cross-modal alignment—using fine-grained subtasks; and (iii) a hybrid dataset combining controllable synthetic scenes with real-world recordings, augmented by multi-dimensional human annotations, adversarial validation, and modular scoring. DAVE systematically exposes structural weaknesses of state-of-the-art models and delivers reproducible, interpretable diagnostic reports. The benchmark dataset and evaluation toolkit are publicly released to advance standardized, robust assessment of audio-visual foundation models.

Technology Category

Application Category

📝 Abstract

Audio-visual understanding is a rapidly evolving field that seeks to integrate and interpret information from both auditory and visual modalities. Despite recent advances in multi-modal learning, existing benchmarks often suffer from strong visual bias -- where answers can be inferred from visual data alone -- and provide only aggregate scores that conflate multiple sources of error. This makes it difficult to determine whether models struggle with visual understanding, audio interpretation, or audio-visual alignment. In this work, we introduce DAVE (Diagnostic Audio Visual Evaluation), a novel benchmark dataset designed to systematically evaluate audio-visual models across controlled challenges. DAVE alleviates existing limitations by (i) ensuring both modalities are necessary to answer correctly and (ii) decoupling evaluation into atomic subcategories. Our detailed analysis of state-of-the-art models reveals specific failure modes and provides targeted insights for improvement. By offering this standardized diagnostic framework, we aim to facilitate more robust development of audio-visual models. The dataset is released: https://github.com/gorjanradevski/dave

Problem

Research questions and friction points this paper is trying to address.

Addresses visual bias in audio-visual benchmarks

Decouples evaluation into atomic subcategories

Identifies specific failure modes in audio-visual models

Innovation

Methods, ideas, or system contributions that make the work stand out.

DAVE ensures both audio and visual data are essential.

DAVE decouples evaluation into specific subcategories.

DAVE provides detailed failure analysis for model improvement.

🔎 Similar Papers

No similar papers found.