Breadcrumbs Reasoning: Memory-Efficient Reasoning with Compression Beacons

📅 2025-10-15

📈 Citations: 0

✨ Influential: 0

career value

211K/year

🤖 AI Summary

To address the prohibitive memory and computational overhead caused by linear growth of key-value (KV) caches in long-context reasoning with large language models, this paper proposes a dynamic cache compression method based on learnable compression tokens. The core innovation lies in introducing dedicated, trainable compression tokens and an end-to-end optimization framework that jointly leverages knowledge distillation and reinforcement learning: an RL policy directly outputs fine-grained compression actions—such as retain, aggregate, or discard—to guide dynamic cache pruning and update. Experiments across mainstream benchmarks demonstrate that our method reduces KV cache memory consumption by up to 5.3×, while preserving or even improving downstream task accuracy. The resulting Pareto frontier dominates both uncompressed baselines and heuristic compression approaches, establishing a scalable, learnable paradigm for efficient long-context inference.

Technology Category

Application Category

📝 Abstract

The scalability of large language models for long-context reasoning is severely constrained by the linear growth of their Transformer key-value cache, which incurs significant memory and computational costs. We posit that as a model generates reasoning tokens, the informational value of past generated tokens diminishes, creating an opportunity for compression. In this work, we propose to periodically compress the generation KV cache with a learned, special-purpose token and evict compressed entries. We train the model to perform this compression via a modified joint distillation and reinforcement learning (RL) framework. Our training method minimizes overhead over the conventional RL process, as it leverages RL outputs for distillation. Empirically, our method achieves a superior memory-accuracy Pareto frontier compared to both the model without cache compression and training-free compression techniques.

Problem

Research questions and friction points this paper is trying to address.

Reducing memory costs of Transformer key-value cache growth

Compressing past tokens with learned compression beacons

Optimizing memory-accuracy tradeoff via joint distillation and RL

Innovation

Methods, ideas, or system contributions that make the work stand out.

Periodically compresses KV cache with learned tokens

Uses joint distillation and reinforcement learning framework

Achieves superior memory-accuracy Pareto frontier

🔎 Similar Papers

FiDeLiS: Faithful Reasoning in Large Language Model for Knowledge Graph Question Answering