AudioKV: KV Cache Eviction in Efficient Large Audio Language Models

๐Ÿ“… 2026-04-08
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
This work addresses the high memory overhead of KV caching in large audio language models during long-context reasoning, a challenge exacerbated by existing generic compression methods that ignore the temporal continuity inherent in audio signals, leading to significant performance degradation. To tackle this, the study introduces a semantic-acoustic alignment mechanism tailored to audio modalities for identifying critical attention heads, combined with an FFT-based spectral scoring smoothing strategy. This enables dynamic and precise KV cache pruning and budget allocation. The proposed approach is hardware-friendly and, at a 40% compression rate, incurs only a 0.45% accuracy drop on Qwen3-Omni-30Bโ€”substantially outperforming baseline methods while effectively mitigating performance collapse and repetitive generation issues.
๐Ÿ“ Abstract
Large Audio-Language Models (LALMs) have set new benchmarks in speech processing, yet their deployment is hindered by the memory footprint of the Key-Value (KV) cache during long-context inference. While general KV cache compression techniques excel in LLMs, they often fail in the audio domain by overlooking the intrinsic temporal continuity of acoustic signals. To bridge this gap, we propose AudioKV, a novel framework that robustly prioritizes audio-critical attention heads through a hardware-friendly semantic-acoustic alignment mechanism. Specifically, we identify these modality-specialized heads by analyzing attention scores in ASR tasks and dynamically allocate KV cache budgets preferentially to them. Furthermore, we introduce Spectral Score Smoothing (SSS), an FFT-based global filtering strategy designed to suppress high-frequency noise and recover smooth global trends from importance scores, ensuring more balanced token selection with unprecedented precision. Extensive evaluations across multiple LALMs, including Qwen and Gemma series, demonstrate that AudioKV significantly outperforms baselines while enhancing computational efficiency. Notably, at a 40% compression ratio, AudioKV maintains near-full accuracy on Qwen3-Omni-30B with only a 0.45% drop, whereas traditional methods suffer from catastrophic performance degradation and repetition. Our code will be released after acceptance.
Problem

Research questions and friction points this paper is trying to address.

KV cache eviction
Large Audio-Language Models
memory footprint
temporal continuity
audio inference
Innovation

Methods, ideas, or system contributions that make the work stand out.

AudioKV
KV cache eviction
Spectral Score Smoothing
large audio-language models
attention head prioritization
๐Ÿ”Ž Similar Papers
No similar papers found.