Mamba2 Meets Silence: Robust Vocal Source Separation for Sparse Regions

📅 2025-08-20
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the challenge of modeling long-range temporal dependencies in audio with intermittent vocal activity—where conventional Transformer-based models struggle—this paper proposes a vocal separation method built upon the Mamba2 state-space model. The approach introduces a frequency-band splitting strategy and a dual-path processing architecture, enhancing both long-sequence modeling capability and robustness to sparse vocal regions. By jointly preserving high-resolution time-frequency representations and enabling efficient long-range dependency capture, the method achieves a state-of-the-art cSDR of 11.03 dB on public benchmarks, along with significant gains in uSDR. The core contribution lies in the first application of Mamba2 to singing voice separation, overcoming the inherent temporal limitations of Transformers in modeling discontinuous speech signals.

Technology Category

Application Category

📝 Abstract
We introduce a new music source separation model tailored for accurate vocal isolation. Unlike Transformer-based approaches, which often fail to capture intermittently occurring vocals, our model leverages Mamba2, a recent state space model, to better capture long-range temporal dependencies. To handle long input sequences efficiently, we combine a band-splitting strategy with a dual-path architecture. Experiments show that our approach outperforms recent state-of-the-art models, achieving a cSDR of 11.03 dB-the best reported to date-and delivering substantial gains in uSDR. Moreover, the model exhibits stable and consistent performance across varying input lengths and vocal occurrence patterns. These results demonstrate the effectiveness of Mamba-based models for high-resolution audio processing and open up new directions for broader applications in audio research.
Problem

Research questions and friction points this paper is trying to address.

Accurate vocal isolation in music source separation
Handling intermittently occurring vocals with long-range dependencies
Efficient processing of long input audio sequences
Innovation

Methods, ideas, or system contributions that make the work stand out.

Mamba2 state space model for vocals
Band-splitting dual-path architecture
Long-range temporal dependency capture
🔎 Similar Papers
No similar papers found.