Seeing Speech and Sound: Distinguishing and Locating Audios in Visual Scenes

📅 2025-03-24
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing audio-visual localization models struggle to simultaneously localize sound sources and discriminate their semantic types in scenes containing mixed speech and non-speech acoustic events. To address this, we propose a “mix-separate” joint learning framework that unifies audio-visual alignment (correspondence) and source-type disentanglement (discriminability) via multimodal contrastive learning and cross-modal embedding disentanglement. We introduce the first benchmark dataset supporting synchronized grounding of mixed-audio scenes and design a vision-acoustic joint representation learning mechanism. Experiments demonstrate that our method achieves significant improvements over prior approaches on mixed-audio grounding, while also attaining state-of-the-art performance on sound source segmentation and cross-modal retrieval. The core innovation lies in formulating localization and type discrimination as a single, unified optimization objective—enabling end-to-end co-learning for the first time.

Technology Category

Application Category

📝 Abstract
We present a unified model capable of simultaneously grounding both spoken language and non-speech sounds within a visual scene, addressing key limitations in current audio-visual grounding models. Existing approaches are typically limited to handling either speech or non-speech sounds independently, or at best, together but sequentially without mixing. This limitation prevents them from capturing the complexity of real-world audio sources that are often mixed. Our approach introduces a 'mix-and-separate' framework with audio-visual alignment objectives that jointly learn correspondence and disentanglement using mixed audio. Through these objectives, our model learns to produce distinct embeddings for each audio type, enabling effective disentanglement and grounding across mixed audio sources. Additionally, we created a new dataset to evaluate simultaneous grounding of mixed audio sources, demonstrating that our model outperforms prior methods. Our approach also achieves comparable or better performance in standard segmentation and cross-modal retrieval tasks, highlighting the benefits of our mix-and-separate approach.
Problem

Research questions and friction points this paper is trying to address.

Grounding mixed speech and non-speech sounds in visual scenes
Disentangling mixed audio sources for accurate localization
Improving audio-visual alignment with a unified model
Innovation

Methods, ideas, or system contributions that make the work stand out.

Mix-and-separate framework for audio-visual alignment
Joint learning of correspondence and disentanglement
New dataset for evaluating mixed audio grounding
🔎 Similar Papers
No similar papers found.