VP-SelDoA: Visual-prompted Selective DoA Estimation of Target Sound via Semantic-Spatial Matching

📅 2025-07-09

📈 Citations: 0

✨ Influential: 0

career value

239K/year

🤖 AI Summary

To address three key challenges in audio-visual sound source localization (AVSSL) under multi-source scenarios—poor target-source selectivity, semantic-visual–spatial-auditory feature mismatch, and excessive reliance on paired audio-visual data—this paper proposes Visual-Prompted Selective Direction-of-Arrival (VP-SelDoA). We introduce a novel cross-instance audio-visual learning (CI-AVL) paradigm, featuring semantic-level modality fusion and Semantic-Spatial Matching to substantially reduce dependence on paired data. Furthermore, we design a Frequency-Temporal ConMamba architecture to generate selective masks, integrating cross- and self-attention for heterogeneous feature alignment. Evaluated on our large-scale spatial audio dataset VGG-SSL, VP-SelDoA achieves a mean absolute error of 12.04° and a localization accuracy of 78.23%, outperforming state-of-the-art methods and demonstrating superior selectivity and generalization.

Technology Category

Application Category

📝 Abstract

Audio-visual sound source localization (AV-SSL) identifies the position of a sound source by exploiting the complementary strengths of auditory and visual signals. However, existing AV-SSL methods encounter three major challenges: 1) inability to selectively isolate the target sound source in multi-source scenarios, 2) misalignment between semantic visual features and spatial acoustic features, and 3) overreliance on paired audio-visual data. To overcome these limitations, we introduce Cross-Instance Audio-Visual Localization (CI-AVL), a novel task that leverages images from different instances of the same sound event category to localize target sound sources, thereby reducing dependence on paired data while enhancing generalization capabilities. Our proposed VP-SelDoA tackles this challenging task through a semantic-level modality fusion and employs a Frequency-Temporal ConMamba architecture to generate target-selective masks for sound isolation. We further develop a Semantic-Spatial Matching mechanism that aligns the heterogeneous semantic and spatial features via integrated cross- and self-attention mechanisms. To facilitate the CI-AVL research, we construct a large-scale dataset named VGG-SSL, comprising 13,981 spatial audio clips across 296 sound event categories. Extensive experiments show that our proposed method outperforms state-of-the-art audio-visual localization methods, achieving a mean absolute error (MAE) of 12.04 and an accuracy (ACC) of 78.23%.

Problem

Research questions and friction points this paper is trying to address.

Selectively isolate target sound in multi-source scenarios

Align semantic visual and spatial acoustic features

Reduce reliance on paired audio-visual data

Innovation

Methods, ideas, or system contributions that make the work stand out.

Semantic-level modality fusion for sound isolation

Frequency-Temporal ConMamba architecture for selective masking

Semantic-Spatial Matching via cross- and self-attention mechanisms

🔎 Similar Papers

A Critical Assessment of Visual Sound Source Localization Models Including Negative Audio