Generate, Analyze, and Refine: Training-Free Sound Source Localization via MLLM Meta-Reasoning

📅 2026-04-08
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing sound source localization methods rely on contrastive learning for audio-visual alignment but lack explicit reasoning mechanisms, limiting their performance in complex acoustic environments. This work proposes the first training-free sound source localization framework by harnessing the metacognitive reasoning capabilities of multimodal large language models. It introduces a novel Generate-Analyze-Refine (GAR) three-stage reasoning pipeline that integrates open-set role labeling, anchor voting, and adaptive gating to quantify audio-visual consistency and refine localization accuracy. Experimental results demonstrate that the proposed method achieves competitive performance on both single-source and multi-source benchmark datasets, effectively validating its robustness and generalization capability in challenging real-world scenarios.
📝 Abstract
Sound source localization task aims to identify the locations of sound-emitting objects by leveraging correlations between audio and visual modalities. Most existing SSL methods rely on contrastive learning-based feature matching, but lack explicit reasoning and verification, limiting their effectiveness in complex acoustic scenes. Inspired by human meta-cognitive processes, we propose a training-free SSL framework that exploits the intrinsic reasoning capabilities of Multimodal Large Language Models (MLLMs). Our Generation-Analysis-Refinement (GAR) pipeline consists of three stages: Generation produces initial bounding boxes and audio classifications; Analysis quantifies Audio-Visual Consistency via open-set role tagging and anchor voting; and Refinement applies adaptive gating to prevent unnecessary adjustments. Extensive experiments on single-source and multi-source benchmarks demonstrate competitive performance. The source code is available at https://github.com/VisualAIKHU/GAR-SSL.
Problem

Research questions and friction points this paper is trying to address.

sound source localization
audio-visual consistency
reasoning
complex acoustic scenes
feature matching
Innovation

Methods, ideas, or system contributions that make the work stand out.

training-free
Multimodal Large Language Models
meta-reasoning
Audio-Visual Consistency
Sound Source Localization
🔎 Similar Papers
No similar papers found.