🤖 AI Summary
This work addresses the high computational and memory overhead of self-attention in self-supervised speech models under low-resource conditions, where existing linear-complexity approaches struggle to effectively capture local contextual information. To this end, the authors propose Windowed SummaryMixing (WSM), a mechanism that integrates local neighborhood summaries alongside global utterance-level representations. Coupled with a selective fine-tuning strategy that updates only the WSM module parameters, the approach significantly enhances temporal modeling while preserving linear time complexity. Experimental results demonstrate that WSM improves recognition performance on low-resource automatic speech recognition (ASR) tasks and reduces peak GPU memory consumption by 40%, substantially lowering computational cost, memory footprint, and inference latency.
📝 Abstract
Self-supervised learning (SSL) has advanced speech processing but suffers from quadratic complexity due to self-attention. To address this, SummaryMixing (SM) has been proposed as a linear-time alternative that summarizes entire utterances using mean pooling but lacks sufficient local context. In this work, we introduce Windowed SummaryMixing (WSM), which enhances SM by integrating local neighborhood summaries alongside the global summary, maintaining efficiency while improving temporal dependencies. Additionally, we introduce a selective fine-tuning approach, replacing self-attention layers in SSL models with WSM blocks and fine-tuning only these blocks in low-resource settings. Our approach improves ASR performance while reducing peak VRAM usage by 40\% in the SSL models. WSM blocks have linear-time complexity with enhanced context awareness. Selectively replacing some attention layers reduces compute, memory, and latency, making it ideal for low-resource speech recognition.