🤖 AI Summary
To address low multi-channel sound source localization accuracy in complex acoustic environments, this paper proposes a bidirectional Mamba architecture that fuses time-domain and frequency-domain features. It is the first work to introduce the state-space model (Mamba) into sound source localization, establishing a time–frequency dual-path bidirectional modeling framework to achieve efficient spatiotemporal joint representation of multichannel audio spatial information. Departing from conventional recurrent or self-attention mechanisms, the method leverages Mamba’s linear computational complexity and long-range modeling capability to significantly enhance robustness in spatial feature extraction. Evaluated on both synthetic and real-world LOCATA datasets, the proposed approach reduces localization error by 15%–28% compared to state-of-the-art methods including LSTM and Transformer, demonstrating superior effectiveness and generalization.
📝 Abstract
Sound source localization (SSL) determines the position of sound sources using multi-channel audio data. It is commonly used to improve speech enhancement and separation. Extracting spatial features is crucial for SSL, especially in challenging acoustic environments. Previous studies performed well based on long short-term memory models. Recently, a novel scalable SSM referred to as Mamba demonstrated notable performance across various sequence-based modalities, including audio and speech. This study introduces the Mamba for SSL tasks. We consider the Mamba-based model to analyze spatial features from speech signals by fusing both time and frequency features, and we develop an SSL system called TF-Mamba. This system integrates time and frequency fusion, with Bidirectional Mamba managing both time-wise and frequency-wise processing. We conduct the experiments on the simulated dataset and the LOCATA dataset. Experiments show that TF-Mamba significantly outperforms other advanced methods on simulated and real-world data.