Efficient Vocal Source Separation Through Windowed Sink Attention

📅 2025-10-29

📈 Citations: 0

✨ Influential: 0

career value

211K/year

🤖 AI Summary

To address the quadratic computational complexity of full-sequence self-attention in source separation models (e.g., Mel-Band-Roformer) with respect to audio length, this work proposes Windowed Sink Attention (WSA): leveraging the empirically observed strong temporal locality of attention weights in pre-trained models, WSA restricts self-attention to small time windows and introduces a learnable “sink” structure for efficient cross-window information aggregation. The method integrates a Mel-band Transformer architecture, fine-tuning of pre-trained models, and chunking/windowing inference optimization. On vocal separation, fine-tuned WSA recovers 92% of the original SDR while reducing computational cost by 44.5×, significantly improving inference efficiency. Code and models are publicly released.

Technology Category

Application Category

📝 Abstract

State-of-the-art vocal separation models like Mel-Band-Roformer rely on full temporal self-attention mechanisms, where each temporal frame interacts with every other frames. This incurs heavy computational costs that scales quadratically with input audio length, motivating chunking and windowing approaches. Through analysis of a pre-trained vocal separation model, we discovered that temporal attention patterns are highly localized. Building on this insight, we replaced full attention with windowed sink attention (WSA) with small temporal attention window and attention sinks. We show empirically that fine-tuning from the original checkpoint recovers 92% of the original SDR performance while reducing FLOPs by 44.5x. We release our code and checkpoints under MIT license at https://github.com/smulelabs/windowed-roformer.

Problem

Research questions and friction points this paper is trying to address.

Reduces computational cost in vocal separation models

Replaces full attention with localized windowed sink attention

Maintains performance while significantly decreasing FLOPs

Innovation

Methods, ideas, or system contributions that make the work stand out.

Replaced full attention with windowed sink attention

Utilized small temporal windows and attention sinks

Achieved 44.5x FLOP reduction while maintaining performance

🔎 Similar Papers

No similar papers found.

Cohere

Toronto, San Francisco, New York City, London, Paris, Montreal, Seoul, Germany, PST, EST

Natural Language Processing Researcher

Kitware

Arlington, Virginia

Authors to Follow