🤖 AI Summary
To address the challenge of modeling long-range temporal dependencies in audio with intermittent vocal activity—where conventional Transformer-based models struggle—this paper proposes a vocal separation method built upon the Mamba2 state-space model. The approach introduces a frequency-band splitting strategy and a dual-path processing architecture, enhancing both long-sequence modeling capability and robustness to sparse vocal regions. By jointly preserving high-resolution time-frequency representations and enabling efficient long-range dependency capture, the method achieves a state-of-the-art cSDR of 11.03 dB on public benchmarks, along with significant gains in uSDR. The core contribution lies in the first application of Mamba2 to singing voice separation, overcoming the inherent temporal limitations of Transformers in modeling discontinuous speech signals.
📝 Abstract
We introduce a new music source separation model tailored for accurate vocal isolation. Unlike Transformer-based approaches, which often fail to capture intermittently occurring vocals, our model leverages Mamba2, a recent state space model, to better capture long-range temporal dependencies. To handle long input sequences efficiently, we combine a band-splitting strategy with a dual-path architecture. Experiments show that our approach outperforms recent state-of-the-art models, achieving a cSDR of 11.03 dB-the best reported to date-and delivering substantial gains in uSDR. Moreover, the model exhibits stable and consistent performance across varying input lengths and vocal occurrence patterns. These results demonstrate the effectiveness of Mamba-based models for high-resolution audio processing and open up new directions for broader applications in audio research.