Reasoning Cache: Continual Improvement Over Long Horizons via Short-Horizon RL

📅 2026-02-03

📈 Citations: 0

✨ Influential: 0

career value

178K/year

🤖 AI Summary

This work addresses the challenge of improving reasoning capabilities in large language models under distribution shift at test time, where performance often plateaus despite increased inference budgets. To overcome this limitation, the authors propose Reasoning Cache (RC), a novel approach that integrates reinforcement learning with iterative decoding. RC exploits the asymmetry between a model’s generation and summarization abilities to construct reasoning chains amenable to iterative refinement, enabling extrapolation of long-horizon reasoning performance within a fixed training budget. Notably, RC is the first method to achieve consistent performance gains on reasoning tasks far exceeding training sequence lengths and substantially enhances the efficiency of existing reasoning scaffolds. On the HMMT 2025 benchmark, a 4B-parameter model trained with only 16k tokens attains nearly 70% accuracy at a test length of 500k tokens—up from 40%—outperforming not only same-scale but also larger models.

Technology Category

Application Category

📝 Abstract

Large Language Models (LLMs) that can continually improve beyond their training budgets are able to solve increasingly difficult problems by adapting at test time, a property we refer to as extrapolation. However, standard reinforcement learning (RL) operates over fixed problem distributions and training budgets, which limits extrapolation amidst distribution shift at test time. To address this, we introduce RC, an iterative decoding algorithm that replaces standard autoregressive decoding during both training and inference. RC exploits an asymmetry between the response generation and summarization capabilities of LLMs to construct reasoning chains that consistently improve across iterations. Models trained to use RC can extrapolate and continually improve over reasoning horizons more than an order of magnitude longer than those seen during training. Empirically, training a 4B model with RC using a 16k-token training budget improves performance on HMMT 2025 from 40% to nearly 70% with 0.5m tokens at test time, outperforming both comparably sized models and many larger reasoning LLMs. Finally, we also show that models trained with RC can more effectively leverage existing scaffolds to further scale test-time performance, due to the improved summary-conditioned generation abilities learned through training.

Problem

Research questions and friction points this paper is trying to address.

extrapolation

reasoning horizon

distribution shift

test-time adaptation

reinforcement learning

Innovation

Methods, ideas, or system contributions that make the work stand out.

Reasoning Cache

iterative decoding

extrapolation