Efficient Vocal Source Separation Through Windowed Sink Attention

πŸ“… 2025-10-29
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
To address the quadratic computational complexity of full-sequence self-attention in source separation models (e.g., Mel-Band-Roformer) with respect to audio length, this work proposes Windowed Sink Attention (WSA): leveraging the empirically observed strong temporal locality of attention weights in pre-trained models, WSA restricts self-attention to small time windows and introduces a learnable β€œsink” structure for efficient cross-window information aggregation. The method integrates a Mel-band Transformer architecture, fine-tuning of pre-trained models, and chunking/windowing inference optimization. On vocal separation, fine-tuned WSA recovers 92% of the original SDR while reducing computational cost by 44.5Γ—, significantly improving inference efficiency. Code and models are publicly released.

Technology Category

Application Category

πŸ“ Abstract
State-of-the-art vocal separation models like Mel-Band-Roformer rely on full temporal self-attention mechanisms, where each temporal frame interacts with every other frames. This incurs heavy computational costs that scales quadratically with input audio length, motivating chunking and windowing approaches. Through analysis of a pre-trained vocal separation model, we discovered that temporal attention patterns are highly localized. Building on this insight, we replaced full attention with windowed sink attention (WSA) with small temporal attention window and attention sinks. We show empirically that fine-tuning from the original checkpoint recovers 92% of the original SDR performance while reducing FLOPs by 44.5x. We release our code and checkpoints under MIT license at https://github.com/smulelabs/windowed-roformer.
Problem

Research questions and friction points this paper is trying to address.

Reduces computational cost in vocal separation models
Replaces full attention with localized windowed sink attention
Maintains performance while significantly decreasing FLOPs
Innovation

Methods, ideas, or system contributions that make the work stand out.

Replaced full attention with windowed sink attention
Utilized small temporal windows and attention sinks
Achieved 44.5x FLOP reduction while maintaining performance
πŸ”Ž Similar Papers
No similar papers found.