🤖 AI Summary
This work addresses the limitation of existing Mamba-based speech separation approaches, which model spectrograms along only a single dimension and thus struggle to capture two-dimensional global dependencies. To overcome this, the study introduces omnidirectional attention into the Mamba architecture for the first time, enabling efficient modeling of global time-frequency dependencies by traversing the spectrogram in ten distinct directions. The proposed method integrates selective state space models with omnidirectional attention to construct an end-to-end time-frequency domain speech separation system. Evaluated on three public datasets, it consistently outperforms current baselines and state-of-the-art methods, demonstrating both its effectiveness and scalability while maintaining linear computational complexity.
📝 Abstract
Mamba, a selective state-space model (SSM), has emerged as an efficient alternative to Transformers for speech modeling, enabling long-sequence processing with linear complexity. While effective in speech separation, existing approaches, whether in the time or time-frequency domain, typically decompose the input along a single dimension into short one-dimensional sequences before processing them with Mamba, which restricts it to local 1D modeling and limits its ability to capture global dependencies across the 2D spectrogram. In this work, we propose an efficient omni-directional attention (OA) mechanism built upon unidirectional Mamba, which models global dependencies from ten different directions on the spectrogram. We expand the proposed mechanism into two baseline separation models and evaluate on three public datasets. Experimental results show that our approach consistently achieves significant performance gains over the baselines while preserving linear complexity, outperforming existing state-of-the-art (SOTA) systems.