🤖 AI Summary
Large reasoning models often suffer from redundant computation and inefficiency in mathematical tasks due to fixed-length chain-of-thought reasoning. To address this, we propose a length-adaptive reasoning framework that internalizes inference depth control as an intrinsic model capability. First, we model the optimal reasoning length distribution via reinforcement learning; second, we introduce a metacognitive guidance mechanism that dynamically adjusts the number of reasoning steps based on contextual cues during inference. Crucially, our approach requires no external controller or manually specified thresholds, enabling end-to-end autonomous depth decision-making. Evaluated on mainstream mathematical reasoning benchmarks, our method reduces token consumption by 40.9% on average over strong baselines while improving accuracy by 2.3%, demonstrating the effectiveness and generalizability of adaptive computational resource allocation aligned with problem complexity.
📝 Abstract
Large reasoning models have achieved remarkable performance through extended chain-of-thought sequences, yet this computational freedom leads to excessive token generation even for simple problems. We present Length-Adaptive Policy Optimization (LAPO), a novel framework that transforms reasoning length control from an external constraint into an intrinsic model capability. Unlike existing approaches that impose rigid limits or rely on post-hoc interventions, LAPO enables models to internalize an understanding of appropriate reasoning depth through a two-stage reinforcement learning process. In the first stage, models learn natural reasoning patterns by discovering the statistical distribution of successful solution lengths. The second stage leverages these patterns as meta-cognitive guidance, embedding them directly within the model's reasoning context to ensure inference-time flexibility. Experiments on mathematical reasoning benchmarks demonstrate that LAPO reduces token usage by up to 40.9% while improving accuracy by 2.3%. Our analysis reveals that models trained with LAPO develop emergent abilities to allocate computational resources based on problem complexity, achieving efficient reasoning without sacrificing quality.