Pruning Long Chain-of-Thought of Large Reasoning Models via Small-Scale Preference Optimization

📅 2025-08-13

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

Large reasoning models (LRMs) suffer from excessive computational overhead and over-reasoning when employing chain-of-thought (CoT) prompting. To address this, we propose Length-Controlled Preference Optimization (LCPO), a preference-based alignment method grounded in the Bradley-Terry model. LCPO explicitly constrains reasoning path length during training while implicitly balancing the negative log-likelihood (NLL)-based reward, enabling joint optimization of output conciseness and reasoning accuracy. We further enhance preference data quality via path distribution analysis and difficulty-aware trajectory filtering. On multiple mathematical and logical reasoning benchmarks, LCPO reduces average output length by over 50% without sacrificing accuracy—yielding substantial gains in inference efficiency. Our key contribution is the first integration of explicit length control into the preference optimization framework, achieving efficient and controllable CoT generation without requiring additional supervision signals.

Technology Category

Application Category

📝 Abstract

Recent advances in Large Reasoning Models (LRMs) have demonstrated strong performance on complex tasks through long Chain-of-Thought (CoT) reasoning. However, their lengthy outputs increase computational costs and may lead to overthinking, raising challenges in balancing reasoning effectiveness and efficiency. Current methods for efficient reasoning often compromise reasoning quality or require extensive resources. This paper investigates efficient methods to reduce the generation length of LRMs. We analyze generation path distributions and filter generated trajectories through difficulty estimation. Subsequently, we analyze the convergence behaviors of the objectives of various preference optimization methods under a Bradley-Terry loss based framework. Based on the analysis, we propose Length Controlled Preference Optimization (LCPO) that directly balances the implicit reward related to NLL loss. LCPO can effectively learn length preference with limited data and training. Extensive experiments demonstrate that our approach significantly reduces the average output length by over 50% across multiple benchmarks while maintaining the reasoning performance. Our work highlights the potential for computationally efficient approaches in guiding LRMs toward efficient reasoning.

Problem

Research questions and friction points this paper is trying to address.

Reducing lengthy outputs of Large Reasoning Models

Balancing reasoning effectiveness and efficiency

Maintaining performance while cutting output length

Innovation

Methods, ideas, or system contributions that make the work stand out.

Length Controlled Preference Optimization (LCPO) method

Analyzes generation path distributions and trajectories

Balances implicit reward with NLL loss

🔎 Similar Papers

No similar papers found.

Authors to Follow