Early Attentive Sparsification Accelerates Neural Speech Transcription

📅 2025-06-18
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the slow inference speed of Transformer-based speech transcription models. We propose a fine-tuning-free early temporal sparsification method that dynamically prunes hidden states in the encoder’s initial layers based on self-attention weights, enabling interpretable and efficient temporal sparsity. Our key contributions are: (i) the first systematic investigation of attention-guided early sparsification strategies for speech Transformers; (ii) a joint search framework optimizing both sparse locations and compression ratios; and (iii) end-to-end GPU-accelerated implementation. Evaluated on the Whisper architecture, our method achieves 40–60% sparsity with less than 1% WER degradation on English speech transcription, while accelerating inference by up to 1.6×—substantially outperforming existing post-hoc or late-stage sparsification approaches.

Technology Category

Application Category

📝 Abstract
Transformer-based neural speech processing has achieved state-of-the-art performance. Since speech audio signals are known to be highly compressible, here we seek to accelerate neural speech transcription by time-domain signal sparsification early in the neural encoding stage, taking advantage of the interpretability of the self-attention mechanism in transformer audio encoders. With the Whisper family of models, we perform a systematic architecture search over the joint space of sparsification stage (a certain encoder layer) and compression ratio (sparsity). We found that the best resulting solutions under 1% accuracy degradation choose to sparsify the hidden state to 40-60% sparsity at an early encoding stage, and thereby achieve up to 1.6x runtime acceleration in English speech transcription tasks on Nvidia GPUs without any fine-tuning.
Problem

Research questions and friction points this paper is trying to address.

Accelerate neural speech transcription via sparsification
Optimize sparsification stage and compression ratio
Achieve runtime speedup without accuracy loss
Innovation

Methods, ideas, or system contributions that make the work stand out.

Early time-domain signal sparsification in encoding
Sparsify hidden state to 40-60% at early stage
Achieves 1.6x runtime acceleration without fine-tuning
🔎 Similar Papers
No similar papers found.