π€ AI Summary
To address the quadratic computational complexity of full-sequence self-attention in source separation models (e.g., Mel-Band-Roformer) with respect to audio length, this work proposes Windowed Sink Attention (WSA): leveraging the empirically observed strong temporal locality of attention weights in pre-trained models, WSA restricts self-attention to small time windows and introduces a learnable βsinkβ structure for efficient cross-window information aggregation. The method integrates a Mel-band Transformer architecture, fine-tuning of pre-trained models, and chunking/windowing inference optimization. On vocal separation, fine-tuned WSA recovers 92% of the original SDR while reducing computational cost by 44.5Γ, significantly improving inference efficiency. Code and models are publicly released.
π Abstract
State-of-the-art vocal separation models like Mel-Band-Roformer rely on full temporal self-attention mechanisms, where each temporal frame interacts with every other frames. This incurs heavy computational costs that scales quadratically with input audio length, motivating chunking and windowing approaches. Through analysis of a pre-trained vocal separation model, we discovered that temporal attention patterns are highly localized. Building on this insight, we replaced full attention with windowed sink attention (WSA) with small temporal attention window and attention sinks. We show empirically that fine-tuning from the original checkpoint recovers 92% of the original SDR performance while reducing FLOPs by 44.5x. We release our code and checkpoints under MIT license at https://github.com/smulelabs/windowed-roformer.