🤖 AI Summary
This work addresses the problem of three-dimensional (3D) binaural sound source localization for overlapping speakers under noisy and reverberant conditions, without assuming prior knowledge of the number of sources. To this end, we propose a gated coarse-to-fine architecture. Methodologically, we introduce multi-head self-attention into 3D binaural signal modeling for the first time, integrate sector-based spatial partitioning with joint coarse classification and fine-grained regression, and design a masked multi-task loss function that enables adjustable spatial resolution and strong robustness. Compared to existing approaches, our method achieves significant improvements in both accuracy and stability for concurrent azimuth and elevation estimation. It attains state-of-the-art (SOTA) performance across diverse real-world noise–reverberation scenarios. The proposed framework establishes a new paradigm for calibration-free, low-latency 3D auditory perception.
📝 Abstract
We propose AuralNet, a novel 3D multi-source binaural sound source localization approach that localizes overlapping sources in both azimuth and elevation without prior knowledge of the number of sources. AuralNet employs a gated coarse-tofine architecture, combining a coarse classification stage with a fine-grained regression stage, allowing for flexible spatial resolution through sector partitioning. The model incorporates a multi-head self-attention mechanism to capture spatial cues in binaural signals, enhancing robustness in noisy-reverberant environments. A masked multi-task loss function is designed to jointly optimize sound detection, azimuth, and elevation estimation. Extensive experiments in noisy-reverberant conditions demonstrate the superiority of AuralNet over recent methods