🤖 AI Summary
Traditional beamforming offers interpretability but suffers from high computational cost and poor generalization; supervised deep learning achieves efficiency and robustness yet relies heavily on large-scale labeled data and lacks interpretability. To address these limitations, we propose a self-supervised Latent Acoustic Mapping (LAM) model—the first to introduce self-supervised learning into acoustic mapping—integrating physical priors with end-to-end deep learning for high-resolution direction-of-arrival estimation without annotated data. LAM supports diverse microphone array configurations and complex acoustic environments, producing latent acoustic maps that are both physically interpretable and highly generalizable, serving as universal acoustic features to enhance downstream tasks. Evaluated on the LOCATA and STARSS benchmarks, LAM achieves localization accuracy comparable to or exceeding state-of-the-art supervised methods, while demonstrating exceptional cross-device adaptability.
📝 Abstract
Acoustic mapping techniques have long been used in spatial audio processing for direction of arrival estimation (DoAE). Traditional beamforming methods for acoustic mapping, while interpretable, often rely on iterative solvers that can be computationally intensive and sensitive to acoustic variability. On the other hand, recent supervised deep learning approaches offer feedforward speed and robustness but require large labeled datasets and lack interpretability. Despite their strengths, both methods struggle to consistently generalize across diverse acoustic setups and array configurations, limiting their broader applicability. We introduce the Latent Acoustic Mapping (LAM) model, a self-supervised framework that bridges the interpretability of traditional methods with the adaptability and efficiency of deep learning methods. LAM generates high-resolution acoustic maps, adapts to varying acoustic conditions, and operates efficiently across different microphone arrays. We assess its robustness on DoAE using the LOCATA and STARSS benchmarks. LAM achieves comparable or superior localization performance to existing supervised methods. Additionally, we show that LAM's acoustic maps can serve as effective features for supervised models, further enhancing DoAE accuracy and underscoring its potential to advance adaptive, high-performance sound localization systems.