SmartThinker: Progressive Chain-of-Thought Length Calibration for Efficient Large Language Model Reasoning

📅 2026-03-09
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the challenge that large language models often generate redundant or over-elaborated chains of thought (CoT) in complex reasoning tasks, while existing approaches struggle to dynamically adjust output length according to problem difficulty, frequently leading to excessive compression and accuracy degradation. To overcome this, the authors propose SmartThinker, an efficient reasoning framework based on Group Relative Policy Optimization (GRPO), which progressively calibrates CoT length during training by dynamically estimating the optimal reasoning depth and adaptively tuning the length-based reward coefficient to avoid penalizing correct reasoning paths. Experimental results demonstrate that SmartThinker reduces average reasoning length by 52.5% while achieving up to a 16.6% absolute accuracy improvement on challenging benchmarks such as AIME25.

Technology Category

Application Category

📝 Abstract
Large reasoning models (LRMs) like OpenAI o1 and DeepSeek-R1 achieve high accuracy on complex tasks by adopting long chain-of-thought (CoT) reasoning paths. However, the inherent verbosity of these processes frequently results in redundancy and overthinking. To address this issue, existing works leverage Group Relative Policy Optimization (GRPO) to reduce LRM output length, but their static length reward design cannot dynamically adapt according to the relative problem difficulty and response length distribution, causing over-compression and compromised accuracy. Therefore, we propose SmartThinker, a novel GRPO-based efficient reasoning method with progressive CoT length calibration. SmartThinker makes a two-fold contribution: First, it dynamically estimates the optimal length with peak accuracy during training and guides overlong responses toward it to reduce response length while sustaining accuracy. Second, it dynamically modulates the length reward coefficient to avoid the unwarranted penalization of correct reasoning paths. Extensive experiment results show that SmartThinker achieves up to 52.5% average length compression with improved accuracy, and achieves up to 16.6% accuracy improvement on challenging benchmarks like AIME25. The source code can be found at https://github.com/SJTU-RTEAS/SmartThinker.
Problem

Research questions and friction points this paper is trying to address.

chain-of-thought
length calibration
large reasoning models
redundancy
accuracy compromise
Innovation

Methods, ideas, or system contributions that make the work stand out.

Chain-of-Thought
Length Calibration
Group Relative Policy Optimization
Efficient Reasoning
Dynamic Reward Modulation
🔎 Similar Papers
No similar papers found.