Object-aware Sound Source Localization via Audio-Visual Scene Understanding

📅 2025-06-23

📈 Citations: 0

✨ Influential: 0

career value

180K/year

🤖 AI Summary

To address the challenge of localizing sounding objects amid visually similar silent ones, this paper proposes an audio-visual joint localization method leveraging multimodal large language models (MLLMs). The method introduces two novel losses: (1) object-aware contrastive alignment (OCA) loss, which enhances semantic separability between sounding foregrounds and silent backgrounds in the cross-modal embedding space; and (2) object region isolation (ORI) loss, which constrains visual attention to genuine sound-source regions. Crucially, it is the first to incorporate fine-grained semantic context generated by MLLMs into sound source localization, significantly improving cross-modal alignment accuracy and object discriminability in complex scenes. Evaluated on MUSIC and VGGSound, the method achieves state-of-the-art performance under both single- and multi-source settings, with substantial gains in localization accuracy over prior approaches.

Technology Category

Application Category

📝 Abstract

Audio-visual sound source localization task aims to spatially localize sound-making objects within visual scenes by integrating visual and audio cues. However, existing methods struggle with accurately localizing sound-making objects in complex scenes, particularly when visually similar silent objects coexist. This limitation arises primarily from their reliance on simple audio-visual correspondence, which does not capture fine-grained semantic differences between sound-making and silent objects. To address these challenges, we propose a novel sound source localization framework leveraging Multimodal Large Language Models (MLLMs) to generate detailed contextual information that explicitly distinguishes between sound-making foreground objects and silent background objects. To effectively integrate this detailed information, we introduce two novel loss functions: Object-aware Contrastive Alignment (OCA) loss and Object Region Isolation (ORI) loss. Extensive experimental results on MUSIC and VGGSound datasets demonstrate the effectiveness of our approach, significantly outperforming existing methods in both single-source and multi-source localization scenarios. Code and generated detailed contextual information are available at: https://github.com/VisualAIKHU/OA-SSL.

Problem

Research questions and friction points this paper is trying to address.

Localize sound-making objects in complex visual scenes

Distinguish sound-making from silent objects accurately

Improve audio-visual correspondence with semantic context

Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses MLLMs for detailed contextual information

Introduces Object-aware Contrastive Alignment loss

Employs Object Region Isolation loss

🔎 Similar Papers

A Critical Assessment of Visual Sound Source Localization Models Including Negative Audio