Reinforcement Learning for Chain of Thought Compression with One-Domain-to-All Generalization

📅 2025-12-19
🏛️ arXiv.org
📈 Citations: 0
Influential: 0
📄 PDF

career value

226K/year
🤖 AI Summary
This work addresses the inefficiency of chain-of-thought reasoning in large language models, where excessive deliberation often introduces redundant steps that increase latency and cost without improving accuracy. The authors propose a sample-level soft reinforcement learning compression method guided by a proficiency-aware gating mechanism: it penalizes overly long reasoning sequences only when the model can already produce a correct answer via a shorter path, enabling precise and adaptive compression. Notably, this approach achieves cross-domain generalization from single-domain training and demonstrates bidirectional transferability between non-agent and tool-augmented agent settings. Experiments show a 20–40% reduction in response length with maintained or improved accuracy; remarkably, models trained solely on mathematical reasoning spontaneously compress unseen tasks such as code generation and instruction following, reducing tool-call trajectory tokens and interaction rounds by 67% and 52%, respectively.

Technology Category

Application Category

📝 Abstract
Chain-of-thought reasoning in large language models can trigger an"overthinking trap": longer rollouts raise cost and latency yet often yield unreliable accuracy gains. Existing methods use global, static controls that may suppress needed reasoning. We propose mastery-gated, sample-level, soft reinforcement learning compression that penalizes long rollouts only when the model already solves the problem and has produced a shorter rollout. Across benchmarks, it cuts response length by 20-40% with comparable or higher accuracy and generalizes across domains: a model trained on math spontaneously shortens unseen tasks (code, instruction following, general-knowledge QA) without hurting accuracy. We further show two-way transfer between non-agent CoT and tool-use agents: non-agent training reduces SWE-Bench Verified rounds by 13%, while compressing a thinking agent cuts SWE trajectories by 67% tokens and 52% rounds and shortens non-agent outputs by up to 44%. Compression is thus not cosmetic brevity, but an inherent computation policy -- what to keep, and what to forget.
Problem

Research questions and friction points this paper is trying to address.

Chain-of-thought
overthinking trap
reasoning compression
reinforcement learning
generalization
Innovation

Methods, ideas, or system contributions that make the work stand out.

reinforcement learning
chain-of-thought compression
mastery-gated control
cross-domain generalization
computation policy