🤖 AI Summary
To address the prohibitive memory and computational overhead caused by linear growth of key-value (KV) caches in long-context reasoning with large language models, this paper proposes a dynamic cache compression method based on learnable compression tokens. The core innovation lies in introducing dedicated, trainable compression tokens and an end-to-end optimization framework that jointly leverages knowledge distillation and reinforcement learning: an RL policy directly outputs fine-grained compression actions—such as retain, aggregate, or discard—to guide dynamic cache pruning and update. Experiments across mainstream benchmarks demonstrate that our method reduces KV cache memory consumption by up to 5.3×, while preserving or even improving downstream task accuracy. The resulting Pareto frontier dominates both uncompressed baselines and heuristic compression approaches, establishing a scalable, learnable paradigm for efficient long-context inference.
📝 Abstract
The scalability of large language models for long-context reasoning is severely constrained by the linear growth of their Transformer key-value cache, which incurs significant memory and computational costs. We posit that as a model generates reasoning tokens, the informational value of past generated tokens diminishes, creating an opportunity for compression. In this work, we propose to periodically compress the generation KV cache with a learned, special-purpose token and evict compressed entries. We train the model to perform this compression via a modified joint distillation and reinforcement learning (RL) framework. Our training method minimizes overhead over the conventional RL process, as it leverages RL outputs for distillation. Empirically, our method achieves a superior memory-accuracy Pareto frontier compared to both the model without cache compression and training-free compression techniques.