🤖 AI Summary
To address the verbosity and inefficiency of large language models’ (LLMs) chain-of-thought (CoT) reasoning—characterized by redundant inference steps—this paper proposes an iterative length-pruning framework based on reinforcement learning. Methodologically, it employs a Proximal Policy Optimization (PPO) algorithm with hard token-length constraints: truncation beyond the budget incurs zero reward, thereby incentivizing the model to autonomously eliminate non-essential reasoning steps rather than resorting to forced early termination. A dynamic token budget and multi-round progressive pruning further enable joint optimization of inference length and task performance. The key contribution is the first application of hard-constraint RL to CoT compression, preserving reasoning fidelity while enhancing efficiency. On the AIME24 benchmark, DeepSeek-R1-Distill-Qwen-1.5B achieves a 50% reduction in average reasoning length with only a 2% accuracy drop, substantially improving the length–performance trade-off.
📝 Abstract
We present ThinkPrune, a simple yet effective method for pruning the thinking length for long-thinking LLMs, which has been found to often produce inefficient and redundant thinking processes. Existing preliminary explorations of reducing thinking length primarily focus on forcing the thinking process to early exit, rather than adapting the LLM to optimize and consolidate the thinking process, and therefore the length-performance tradeoff observed so far is sub-optimal. To fill this gap, ThinkPrune offers a simple solution that continuously trains the long-thinking LLMs via reinforcement learning (RL) with an added token limit, beyond which any unfinished thoughts and answers will be discarded, resulting in a zero reward. To further preserve model performance, we introduce an iterative length pruning approach, where multiple rounds of RL are conducted, each with an increasingly more stringent token limit. We observed that ThinkPrune results in a remarkable performance-length tradeoff -- on the AIME24 dataset, the reasoning length of DeepSeek-R1-Distill-Qwen-1.5B can be reduced by half with only 2% drop in performance. We also observed that after pruning, the LLMs can bypass unnecessary steps while keeping the core reasoning process complete. Code is available at https://github.com/UCSB-NLP-Chang/ThinkPrune.