🤖 AI Summary
This work addresses the high memory and computational costs of autoregressive decoding in large reasoning models during reinforcement learning training, as well as the degradation of long-context coherence caused by sliding-window caching. To overcome these limitations, the authors propose Progressive Thought Encoding (PTE), a method that compresses intermediate reasoning steps into compact vector representations within a fixed-size cache, eliminating the need for full backpropagation through the entire sequence. This approach significantly reduces training memory consumption while maintaining constant memory usage during inference. PTE enables efficient long-horizon reasoning under parameter-efficient fine-tuning for the first time, surpassing the coherence constraints of existing caching strategies. Experiments on Qwen2.5-3B/7B and DeepSeek-R1-Distill-Llama-8B demonstrate average accuracy gains of 19.3% over LoRA and 29.9% over non-finetuned LRMs, with up to a 23.4% improvement on AIME2024/2025 benchmarks.
📝 Abstract
Large reasoning models (LRMs) excel on complex problems but face a critical barrier to efficiency: reinforcement learning (RL) training requires long rollouts for outcome-based rewards, where autoregressive decoding dominates time and memory usage. While sliding-window cache strategies can bound memory, they disrupt long-context reasoning and degrade performance. We introduce Progressive Thought Encoding, a parameter-efficient fine-tuning method that enables LRMs to reason effectively under fixed-size caches. By progressively encoding intermediate reasoning into fixed-size vector representations, our approach eliminates the need to backpropagate through full-cache rollouts, thereby reducing memory usage, while maintaining constant memory during inference. Experiments on three models, including Qwen2.5-3B-Instruct, Qwen2.5-7B-Instruct, and DeepSeek-R1-Distill-Llama-8B, on six widely used challenging mathematical benchmarks show consistent gains: our method achieves +19.3% improvement over LoRA-based fine-tuning and +29.9% over LRMs without fine-tuning on average, with up to +23.4 accuracy improvement on AIME2024/2025 under the same tight cache budgets. These results demonstrate that Progressive Thought Encoding not only improves reasoning accuracy but also makes RL training of LRMs substantially more efficient and scalable under real-world memory constraints.