AuralNet: Hierarchical Attention-based 3D Binaural Localization of Overlapping Speakers

📅 2025-06-03

📈 Citations: 0

✨ Influential: 0

career value

259K/year

🤖 AI Summary

This work addresses the problem of three-dimensional (3D) binaural sound source localization for overlapping speakers under noisy and reverberant conditions, without assuming prior knowledge of the number of sources. To this end, we propose a gated coarse-to-fine architecture. Methodologically, we introduce multi-head self-attention into 3D binaural signal modeling for the first time, integrate sector-based spatial partitioning with joint coarse classification and fine-grained regression, and design a masked multi-task loss function that enables adjustable spatial resolution and strong robustness. Compared to existing approaches, our method achieves significant improvements in both accuracy and stability for concurrent azimuth and elevation estimation. It attains state-of-the-art (SOTA) performance across diverse real-world noise–reverberation scenarios. The proposed framework establishes a new paradigm for calibration-free, low-latency 3D auditory perception.

Technology Category

Application Category

📝 Abstract

We propose AuralNet, a novel 3D multi-source binaural sound source localization approach that localizes overlapping sources in both azimuth and elevation without prior knowledge of the number of sources. AuralNet employs a gated coarse-tofine architecture, combining a coarse classification stage with a fine-grained regression stage, allowing for flexible spatial resolution through sector partitioning. The model incorporates a multi-head self-attention mechanism to capture spatial cues in binaural signals, enhancing robustness in noisy-reverberant environments. A masked multi-task loss function is designed to jointly optimize sound detection, azimuth, and elevation estimation. Extensive experiments in noisy-reverberant conditions demonstrate the superiority of AuralNet over recent methods

Problem

Research questions and friction points this paper is trying to address.

Localizes overlapping sound sources in 3D space without prior source count

Uses hierarchical attention to enhance robustness in noisy environments

Jointly optimizes detection and spatial estimation via multi-task learning

Innovation

Methods, ideas, or system contributions that make the work stand out.

Gated coarse-to-fine hierarchical architecture

Multi-head self-attention for spatial cues

Masked multi-task loss joint optimization

🔎 Similar Papers

No similar papers found.