AuralNet: Hierarchical Attention-based 3D Binaural Localization of Overlapping Speakers

📅 2025-06-03
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the problem of three-dimensional (3D) binaural sound source localization for overlapping speakers under noisy and reverberant conditions, without assuming prior knowledge of the number of sources. To this end, we propose a gated coarse-to-fine architecture. Methodologically, we introduce multi-head self-attention into 3D binaural signal modeling for the first time, integrate sector-based spatial partitioning with joint coarse classification and fine-grained regression, and design a masked multi-task loss function that enables adjustable spatial resolution and strong robustness. Compared to existing approaches, our method achieves significant improvements in both accuracy and stability for concurrent azimuth and elevation estimation. It attains state-of-the-art (SOTA) performance across diverse real-world noise–reverberation scenarios. The proposed framework establishes a new paradigm for calibration-free, low-latency 3D auditory perception.

Technology Category

Application Category

📝 Abstract
We propose AuralNet, a novel 3D multi-source binaural sound source localization approach that localizes overlapping sources in both azimuth and elevation without prior knowledge of the number of sources. AuralNet employs a gated coarse-tofine architecture, combining a coarse classification stage with a fine-grained regression stage, allowing for flexible spatial resolution through sector partitioning. The model incorporates a multi-head self-attention mechanism to capture spatial cues in binaural signals, enhancing robustness in noisy-reverberant environments. A masked multi-task loss function is designed to jointly optimize sound detection, azimuth, and elevation estimation. Extensive experiments in noisy-reverberant conditions demonstrate the superiority of AuralNet over recent methods
Problem

Research questions and friction points this paper is trying to address.

Localizes overlapping sound sources in 3D space without prior source count
Uses hierarchical attention to enhance robustness in noisy environments
Jointly optimizes detection and spatial estimation via multi-task learning
Innovation

Methods, ideas, or system contributions that make the work stand out.

Gated coarse-to-fine hierarchical architecture
Multi-head self-attention for spatial cues
Masked multi-task loss joint optimization
🔎 Similar Papers
No similar papers found.
Linya Fu
Linya Fu
The Hong Kong Polytechnic University
Robot Audition
Y
Yu Liu
School of Automation and Intelligent Manufacturing, Southern University of Science and Technology (SUSTech), China
Z
Zhijie Liu
School of Automation and Intelligent Manufacturing, Southern University of Science and Technology (SUSTech), China
Z
Zedong Yang
School of Automation and Intelligent Manufacturing, Southern University of Science and Technology (SUSTech), China
Zhong-Qiu Wang
Zhong-Qiu Wang
Associate Professor, Southern University of Science and Technology
Computer AuditionSpeech SeparationMicrophone ArrayAudio Signal ProcessingDeep Learning
Youfu Li
Youfu Li
Professor of Mechanical Engineering, City University of Hong Kong
Robot visionvisual trackingrobot sensingmechatronics and automation
H
He Kong
School of Automation and Intelligent Manufacturing, Southern University of Science and Technology (SUSTech), China