Improving Sound Source Localization with Joint Slot Attention on Image and Audio

📅 2025-04-21
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses weakly supervised audio-visual sound source localization (SSL) without precise spatial annotations. To overcome the limitations of existing contrastive learning approaches—whose global image/audio embeddings suffer from background noise and irrelevant content—we propose a novel framework integrating joint slot attention with bidirectional cross-modal attention matching. Specifically, we introduce an image-audio joint slot attention mechanism to explicitly disentangle target sounds from interfering components, and design a bidirectional cross-modal local feature alignment module to transcend conventional unidirectional embedding aggregation. Our method synergistically combines multi-granularity feature disentanglement, slot attention, contrastive learning, and cross-modal attention. Evaluated on three major SSL benchmarks, it achieves state-of-the-art performance in both sound source localization accuracy and cross-modal retrieval, demonstrating significant improvements over prior methods.

Technology Category

Application Category

📝 Abstract
Sound source localization (SSL) is the task of locating the source of sound within an image. Due to the lack of localization labels, the de facto standard in SSL has been to represent an image and audio as a single embedding vector each, and use them to learn SSL via contrastive learning. To this end, previous work samples one of local image features as the image embedding and aggregates all local audio features to obtain the audio embedding, which is far from optimal due to the presence of noise and background irrelevant to the actual target in the input. We present a novel SSL method that addresses this chronic issue by joint slot attention on image and audio. To be specific, two slots competitively attend image and audio features to decompose them into target and off-target representations, and only target representations of image and audio are used for contrastive learning. Also, we introduce cross-modal attention matching to further align local features of image and audio. Our method achieved the best in almost all settings on three public benchmarks for SSL, and substantially outperformed all the prior work in cross-modal retrieval.
Problem

Research questions and friction points this paper is trying to address.

Improves sound source localization via joint slot attention
Addresses noise and irrelevant background in input features
Enhances cross-modal alignment for image and audio
Innovation

Methods, ideas, or system contributions that make the work stand out.

Joint slot attention on image and audio
Decompose features into target and off-target
Cross-modal attention matching for alignment
🔎 Similar Papers
No similar papers found.