🤖 AI Summary
Chain-of-thought (CoT) reasoning in large reasoning models (LRMs) often yields redundant, excessively long inference chains, resulting in low efficiency and high computational cost; existing fine-tuning methods lack standardized evaluation metrics to reliably quantify efficiency gains. Method: We introduce the “reasoning efficiency frontier” concept and the unified “Reasoning Efficiency Gap” (REG) metric to systematically characterize the accuracy–inference-length trade-off deficiency. We further propose REO-RL, a reinforcement learning algorithm integrating exponential sparse token-budget sampling with numerical integration for global optimization approximation. Results: On Qwen3-4B/8B, REO-RL consistently reduces REG by ≥50% and approaches the efficiency frontier under 16K-token budgets while incurring <0.5% accuracy degradation. REG exhibits strong correlation with human evaluation (Spearman’s ρ > 0.92), validating its fidelity as an efficiency proxy.
📝 Abstract
Large Reasoning Models (LRMs) demonstrate remarkable problem-solving capabilities through extended Chain-of-Thought (CoT) reasoning but often produce excessively verbose and redundant reasoning traces. This inefficiency incurs high inference costs and limits practical deployment. While existing fine-tuning methods aim to improve reasoning efficiency, assessing their efficiency gains remains challenging due to inconsistent evaluations. In this work, we introduce the reasoning efficiency frontiers, empirical upper bounds derived from fine-tuning base LRMs across diverse approaches and training configurations. Based on these frontiers, we propose the Reasoning Efficiency Gap (REG), a unified metric quantifying deviations of any fine-tuned LRMs from these frontiers. Systematic evaluation on challenging mathematical benchmarks reveals significant gaps in current methods: they either sacrifice accuracy for short length or still remain inefficient under tight token budgets. To reduce the efficiency gap, we propose REO-RL, a class of Reinforcement Learning algorithms that minimizes REG by targeting a sparse set of token budgets. Leveraging numerical integration over strategically selected budgets, REO-RL approximates the full efficiency objective with low error using a small set of token budgets. Through systematic benchmarking, we demonstrate that our efficiency metric, REG, effectively captures the accuracy-length trade-off, with low-REG methods reducing length while maintaining accuracy. Our approach, REO-RL, consistently reduces REG by>=50 across all evaluated LRMs and matching Qwen3-4B/8B efficiency frontiers under a 16K token budget with minimal accuracy loss. Ablation studies confirm the effectiveness of our exponential token budget strategy. Finally, our findings highlight that fine-tuning LRMs to perfectly align with the efficiency frontiers remains an open challenge.