🤖 AI Summary
To address three key challenges in audio-visual sound source localization (AVSSL) under multi-source scenarios—poor target-source selectivity, semantic-visual–spatial-auditory feature mismatch, and excessive reliance on paired audio-visual data—this paper proposes Visual-Prompted Selective Direction-of-Arrival (VP-SelDoA). We introduce a novel cross-instance audio-visual learning (CI-AVL) paradigm, featuring semantic-level modality fusion and Semantic-Spatial Matching to substantially reduce dependence on paired data. Furthermore, we design a Frequency-Temporal ConMamba architecture to generate selective masks, integrating cross- and self-attention for heterogeneous feature alignment. Evaluated on our large-scale spatial audio dataset VGG-SSL, VP-SelDoA achieves a mean absolute error of 12.04° and a localization accuracy of 78.23%, outperforming state-of-the-art methods and demonstrating superior selectivity and generalization.
📝 Abstract
Audio-visual sound source localization (AV-SSL) identifies the position of a sound source by exploiting the complementary strengths of auditory and visual signals. However, existing AV-SSL methods encounter three major challenges: 1) inability to selectively isolate the target sound source in multi-source scenarios, 2) misalignment between semantic visual features and spatial acoustic features, and 3) overreliance on paired audio-visual data. To overcome these limitations, we introduce Cross-Instance Audio-Visual Localization (CI-AVL), a novel task that leverages images from different instances of the same sound event category to localize target sound sources, thereby reducing dependence on paired data while enhancing generalization capabilities. Our proposed VP-SelDoA tackles this challenging task through a semantic-level modality fusion and employs a Frequency-Temporal ConMamba architecture to generate target-selective masks for sound isolation. We further develop a Semantic-Spatial Matching mechanism that aligns the heterogeneous semantic and spatial features via integrated cross- and self-attention mechanisms. To facilitate the CI-AVL research, we construct a large-scale dataset named VGG-SSL, comprising 13,981 spatial audio clips across 296 sound event categories. Extensive experiments show that our proposed method outperforms state-of-the-art audio-visual localization methods, achieving a mean absolute error (MAE) of 12.04 and an accuracy (ACC) of 78.23%.