Neural Garbage Collection: Learning to Forget while Learning to Reason

📅 2026-04-20

📈 Citations: 0

✨ Influential: 0

career value

208K/year

🤖 AI Summary

This work addresses the computational and memory bottlenecks in chain-of-thought reasoning caused by the unbounded growth of key-value (KV) cache, which hinders model scalability. The authors propose the first end-to-end learnable cache management mechanism that enables language models to autonomously decide which redundant KV entries to evict during inference. By modeling both reasoning and cache eviction as discrete actions, the method leverages only task-level reward signals and jointly optimizes the policy via reinforcement learning—without requiring supervised fine-tuning or proxy objectives. Evaluated on Countdown, AMC, and AIME benchmarks, the approach achieves 2–3× KV cache compression while closely matching the performance of full-cache baselines, significantly outperforming existing eviction strategies.

Technology Category

Application Category

📝 Abstract

Chain-of-thought reasoning has driven striking advances in language model capability, yet every reasoning step grows the KV cache, creating a bottleneck to scaling this paradigm further. Current approaches manage these constraints on the model's behalf using hand-designed criteria. A more scalable approach would let end-to-end learning subsume this design choice entirely, following a broader pattern in deep learning. After all, if a model can learn to reason, why can't it learn to forget? We introduce Neural Garbage Collection (NGC), in which a language model learns to forget while learning to reason, trained end-to-end from outcome-based task reward alone. As the model reasons, it periodically pauses, decides which KV cache entries to evict, and continues to reason conditioned on the remaining cache. By treating tokens in a chain-of-thought and cache-eviction decisions as discrete actions sampled from the language model, we can use reinforcement learning to jointly optimize how the model reasons and how it manages its own memory: what the model evicts shapes what it remembers, what it remembers shapes its reasoning, and the correctness of that reasoning determines its reward. Crucially, the model learns this behavior entirely from a single learning signal - the outcome-based task reward - without supervised fine-tuning or proxy objectives. On Countdown, AMC, and AIME tasks, NGC maintains strong accuracy relative to the full-cache upper bound at 2-3x peak KV cache size compression and substantially outperforms eviction baselines. Our results are a first step towards a broader vision where end-to-end optimization drives both capability and efficiency in language models.

Problem

Research questions and friction points this paper is trying to address.

KV cache

chain-of-thought reasoning

memory management

language models

scalability

Innovation

Methods, ideas, or system contributions that make the work stand out.

Neural Garbage Collection

KV cache compression

chain-of-thought reasoning