Can LLMs Maintain Fundamental Abilities under KV Cache Compression?

📅 2025-02-04

📈 Citations: 0

✨ Influential: 0

career value

183K/year

🤖 AI Summary

This work systematically investigates the impact of KV cache compression on the foundational capabilities of large language models (LLMs), addressing a critical gap in prior research—which predominantly optimizes for long-context compression ratios while neglecting capability robustness. To this end, the authors develop a comprehensive evaluation framework spanning world knowledge, commonsense/arithmetic reasoning, code generation, safety, and long-context understanding. Their analysis reveals that arithmetic reasoning is particularly sensitive to KV compression. Building on this insight, they propose ShotKV: a dynamic KV compression method that decouples prefill and decoding phases and enforces shot-level semantic consistency. Experiments demonstrate that ShotKV improves long-context generation performance by 9%–18% under aggressive compression. Furthermore, DeepSeek R1 Distill exhibits superior compression robustness—suffering only 9.67%–25.53% performance degradation—highlighting the critical role of architectural design in compression adaptability.

Technology Category

Application Category

📝 Abstract

This paper investigates an under-explored challenge in large language models (LLMs): the impact of KV cache compression methods on LLMs' fundamental capabilities. While existing methods achieve impressive compression ratios on long-context benchmarks, their effects on core model capabilities remain understudied. We present a comprehensive empirical study evaluating prominent KV cache compression methods across diverse tasks, spanning world knowledge, commonsense reasoning, arithmetic reasoning, code generation, safety, and long-context understanding and generation.Our analysis reveals that KV cache compression methods exhibit task-specific performance degradation. Arithmetic reasoning tasks prove particularly sensitive to aggressive compression, with different methods showing performance drops of $17.4%$-$43.3%$. Notably, the DeepSeek R1 Distill model exhibits more robust compression tolerance compared to instruction-tuned models, showing only $9.67%$-$25.53%$ performance degradation. Based on our analysis of attention patterns and cross-task compression performance, we propose ShotKV, a novel compression approach that distinctly handles prefill and decoding phases while maintaining shot-level semantic coherence. Empirical results show that ShotKV achieves $9%$-$18%$ performance improvements on long-context generation tasks under aggressive compression ratios.

Problem

Research questions and friction points this paper is trying to address.

Impact of KV cache compression on LLMs' capabilities.

Task-specific performance degradation due to compression.

Proposal of ShotKV for improved compression performance.

Innovation

Methods, ideas, or system contributions that make the work stand out.

KV cache compression methods

task-specific performance degradation

ShotKV novel compression approach

🔎 Similar Papers

No similar papers found.