Mamba2 Meets Silence: Robust Vocal Source Separation for Sparse Regions

📅 2025-08-20

📈 Citations: 0

✨ Influential: 0

career value

213K/year

🤖 AI Summary

To address the challenge of modeling long-range temporal dependencies in audio with intermittent vocal activity—where conventional Transformer-based models struggle—this paper proposes a vocal separation method built upon the Mamba2 state-space model. The approach introduces a frequency-band splitting strategy and a dual-path processing architecture, enhancing both long-sequence modeling capability and robustness to sparse vocal regions. By jointly preserving high-resolution time-frequency representations and enabling efficient long-range dependency capture, the method achieves a state-of-the-art cSDR of 11.03 dB on public benchmarks, along with significant gains in uSDR. The core contribution lies in the first application of Mamba2 to singing voice separation, overcoming the inherent temporal limitations of Transformers in modeling discontinuous speech signals.

Technology Category

Application Category

📝 Abstract

We introduce a new music source separation model tailored for accurate vocal isolation. Unlike Transformer-based approaches, which often fail to capture intermittently occurring vocals, our model leverages Mamba2, a recent state space model, to better capture long-range temporal dependencies. To handle long input sequences efficiently, we combine a band-splitting strategy with a dual-path architecture. Experiments show that our approach outperforms recent state-of-the-art models, achieving a cSDR of 11.03 dB-the best reported to date-and delivering substantial gains in uSDR. Moreover, the model exhibits stable and consistent performance across varying input lengths and vocal occurrence patterns. These results demonstrate the effectiveness of Mamba-based models for high-resolution audio processing and open up new directions for broader applications in audio research.

Problem

Research questions and friction points this paper is trying to address.

Accurate vocal isolation in music source separation

Handling intermittently occurring vocals with long-range dependencies

Efficient processing of long input audio sequences

Innovation

Methods, ideas, or system contributions that make the work stand out.

Mamba2 state space model for vocals

Band-splitting dual-path architecture

Long-range temporal dependency capture

🔎 Similar Papers

No similar papers found.