Reinforcement Learning for Chain of Thought Compression with One-Domain-to-All Generalization

📅 2025-12-19
🏛️ arXiv.org
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the inefficiency of chain-of-thought reasoning in large language models, where excessive deliberation often introduces redundant steps that increase latency and cost without improving accuracy. The authors propose a sample-level soft reinforcement learning compression method guided by a proficiency-aware gating mechanism: it penalizes overly long reasoning sequences only when the model can already produce a correct answer via a shorter path, enabling precise and adaptive compression. Notably, this approach achieves cross-domain generalization from single-domain training and demonstrates bidirectional transferability between non-agent and tool-augmented agent settings. Experiments show a 20–40% reduction in response length with maintained or improved accuracy; remarkably, models trained solely on mathematical reasoning spontaneously compress unseen tasks such as code generation and instruction following, reducing tool-call trajectory tokens and interaction rounds by 67% and 52%, respectively.

Technology Category

Application Category

📝 Abstract
Chain-of-thought reasoning in large language models can trigger an"overthinking trap": longer rollouts raise cost and latency yet often yield unreliable accuracy gains. Existing methods use global, static controls that may suppress needed reasoning. We propose mastery-gated, sample-level, soft reinforcement learning compression that penalizes long rollouts only when the model already solves the problem and has produced a shorter rollout. Across benchmarks, it cuts response length by 20-40% with comparable or higher accuracy and generalizes across domains: a model trained on math spontaneously shortens unseen tasks (code, instruction following, general-knowledge QA) without hurting accuracy. We further show two-way transfer between non-agent CoT and tool-use agents: non-agent training reduces SWE-Bench Verified rounds by 13%, while compressing a thinking agent cuts SWE trajectories by 67% tokens and 52% rounds and shortens non-agent outputs by up to 44%. Compression is thus not cosmetic brevity, but an inherent computation policy -- what to keep, and what to forget.
Problem

Research questions and friction points this paper is trying to address.

Chain-of-thought
overthinking trap
reasoning compression
reinforcement learning
generalization
Innovation

Methods, ideas, or system contributions that make the work stand out.

reinforcement learning
chain-of-thought compression
mastery-gated control
cross-domain generalization
computation policy
H
Hanyu Li
LLM-Core Xiaomi
J
Jiangshan Duo
State Key Laboratory of Multimedia Information Processing, School of Computer Science, Peking University
Bofei Gao
Bofei Gao
Peking University
Natural Language Processing
H
Hailin Zhang
LLM-Core Xiaomi
S
Sujian Li
State Key Laboratory of Multimedia Information Processing, School of Computer Science, Peking University
Xiaotie Deng
Xiaotie Deng
Chair Professor of Computer Science, Peking University, Beijing, China
Algorithmic Game TheoryApproximate ComputingParallel ComputingCombinatorial Optimization
Liang Zhao
Liang Zhao
StepFun
MLLMLLM