Seeing Sound, Hearing Sight: Uncovering Modality Bias and Conflict of AI models in Sound Localization

📅 2025-05-16
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work identifies a significant visual dominance bias in multimodal AI for sound source localization, causing substantial performance degradation—far below human capabilities—under audio-visual conflict or auditory-only conditions. To address this, the authors introduce the first 3D-simulated binaural-audio–image dataset explicitly designed for modality bias analysis, fine-tune state-of-the-art models (e.g., Audio-Visual Transformer), and develop a psychophysical experimental paradigm for rigorous human-AI comparison. The study provides the first systematic characterization of AI’s modality preference mechanisms and proposes a human-level azimuthal localization model inspired by interaural time difference (ITD) computation and anatomical alignment with human pinnae. Experiments demonstrate that the optimized model achieves a 42% improvement in localization accuracy under audio-visual conflict, successfully replicates the human left-right accuracy asymmetry, and exhibits markedly enhanced robustness.

Technology Category

Application Category

📝 Abstract
Imagine hearing a dog bark and turning toward the sound only to see a parked car, while the real, silent dog sits elsewhere. Such sensory conflicts test perception, yet humans reliably resolve them by prioritizing sound over misleading visuals. Despite advances in multimodal AI integrating vision and audio, little is known about how these systems handle cross-modal conflicts or whether they favor one modality. In this study, we systematically examine modality bias and conflict resolution in AI sound localization. We assess leading multimodal models and benchmark them against human performance in psychophysics experiments across six audiovisual conditions, including congruent, conflicting, and absent cues. Humans consistently outperform AI, demonstrating superior resilience to conflicting or missing visuals by relying on auditory information. In contrast, AI models often default to visual input, degrading performance to near chance levels. To address this, we finetune a state-of-the-art model using a stereo audio-image dataset generated via 3D simulations. Even with limited training data, the refined model surpasses existing benchmarks. Notably, it also mirrors human-like horizontal localization bias favoring left-right precision-likely due to the stereo audio structure reflecting human ear placement. These findings underscore how sensory input quality and system architecture shape multimodal representation accuracy.
Problem

Research questions and friction points this paper is trying to address.

AI models struggle with cross-modal conflicts in sound localization
Humans outperform AI in resolving conflicting audiovisual cues
Current AI models overly rely on visual input, reducing accuracy
Innovation

Methods, ideas, or system contributions that make the work stand out.

Finetuning AI with stereo audio-image dataset
Using 3D simulations for training data generation
Improving sound localization via human-like bias
🔎 Similar Papers
No similar papers found.
Yanhao Jia
Yanhao Jia
Nanyang Technological University
Artificial IntelligenceDeep LearningComputational Neuroscience
Ji Xie
Ji Xie
Research Intern, UC Berkeley
Computer VisionImage GenerationMulti-Modal
S
S. Jivaganesh
College of Computing and Data Science, Nanyang Technological University, Singapore
H
Hao Li
School of Electronic and Computer Engineering, Peking University, China
X
Xu Wu
College of Computing and Data Science, Nanyang Technological University, Singapore; College of Computer Science and Software Engineering, Shenzhen University, China
Mengmi Zhang
Mengmi Zhang
Assistant professor and PI of Deep NeuroCognition Lab, Nanyang Technological University, Singapore
neuroscience-inspired AIcomputer visioncomputational neurosciencecognitive science