Whisper-MLA: Reducing GPU Memory Consumption of ASR Models based on MHA2MLA Conversion

πŸ“… 2026-02-28
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
This work addresses the excessive key-value (KV) cache memory consumption of the Whisper model in long-form automatic speech recognition, which stems from its multi-head attention (MHA) mechanism and hinders efficient deployment. To mitigate this, the authors propose the first integration of Multi-head Latent Attention (MLA) into Whisper by replacing the original MHA in the decoder self-attention modules and adapting it to Whisper’s absolute positional encoding scheme. Systematic experiments demonstrate that applying MLA solely within the decoder self-attention reduces KV cache usage by up to 87.5% on LibriSpeech while maintaining competitive word error rates. This architectural modification substantially enhances inference efficiency and memory economy without compromising recognition accuracy.

Technology Category

Application Category

πŸ“ Abstract
The Transformer-based Whisper model has achieved state-of-the-art performance in Automatic Speech Recognition (ASR). However, its Multi-Head Attention (MHA) mechanism results in significant GPU memory consumption due to the linearly growing Key-Value (KV) cache usage, which is problematic for many applications especially with long-form audio. To address this, we introduce Whisper-MLA, a novel architecture that incorporates Multi-Head Latent Attention (MLA) into the Whisper model. Specifically, we adapt MLA for Whisper's absolute positional embeddings and systematically investigate its application across encoder self-attention, decoder self-attention, and cross-attention modules. Empirical results indicate that applying MLA exclusively to decoder self-attention yields the desired balance between performance and memory efficiency. Our proposed approach allows conversion of a pretrained Whisper model to Whisper-MLA with minimal fine-tuning. Extensive experiments on the LibriSpeech benchmark validate the effectiveness of this conversion, demonstrating that Whisper-MLA reduces the KV cache size by up to 87.5% while maintaining competitive accuracy.
Problem

Research questions and friction points this paper is trying to address.

Automatic Speech Recognition
GPU memory consumption
Multi-Head Attention
KV cache
long-form audio
Innovation

Methods, ideas, or system contributions that make the work stand out.

Multi-Head Latent Attention
KV cache compression
Whisper model
Automatic Speech Recognition
Memory-efficient ASR
πŸ”Ž Similar Papers
No similar papers found.