VGGSounder: Audio-Visual Evaluations for Foundation Models

📅 2025-08-11

📈 Citations: 0

✨ Influential: 0

career value

198K/year

🤖 AI Summary

The VGGSound benchmark suffers from label incompleteness, category overlap, and audio-visual misalignment, leading to distorted evaluation of audio-visual models. To address this, we propose VGGSounder—a rigorously reconstructed test set featuring human-curated, fine-grained, multi-label annotations that explicitly decouple auditory and visual cues. We further introduce the Modality Confusion Index (MCI), a quantitative metric measuring performance degradation under cross-modal interference. Within an audio-visual contrastive learning framework, we systematically evaluate mainstream models on VGGSounder. Experiments reveal that existing models consistently degrade in performance when redundant modalities are introduced, exposing critical bottlenecks in cross-modal understanding. VGGSounder thus provides a more reliable, interpretable, and rigorous evaluation benchmark for foundational multimodal models—enabling precise diagnosis of modality-specific reasoning capabilities and inter-modal integration fidelity.

Technology Category

Application Category

📝 Abstract

The emergence of audio-visual foundation models underscores the importance of reliably assessing their multi-modal understanding. The VGGSounder dataset is commonly used as a benchmark for evaluation audio-visual classification. However, our analysis identifies several limitations of VGGSounder, including incomplete labelling, partially overlapping classes, and misaligned modalities. These lead to distorted evaluations of auditory and visual capabilities. To address these limitations, we introduce VGGSounder, a comprehensively re-annotated, multi-label test set that extends VGGSound and is specifically designed to evaluate audio-visual foundation models. VGGSounder features detailed modality annotations, enabling precise analyses of modality-specific performance. Furthermore, we reveal model limitations by analysing performance degradation when adding another input modality with our new modality confusion metric.

Problem

Research questions and friction points this paper is trying to address.

Limitations in VGGSounder dataset for audio-visual evaluation

Incomplete labeling and misaligned modalities distort assessments

Need for precise modality-specific performance analysis

Innovation

Methods, ideas, or system contributions that make the work stand out.

Re-annotated multi-label test set

Detailed modality annotations

New modality confusion metric

🔎 Similar Papers

No similar papers found.