LAPO: Internalizing Reasoning Efficiency via Length-Adaptive Policy Optimization

📅 2025-07-21

📈 Citations: 0

✨ Influential: 0

career value

186K/year

🤖 AI Summary

Large reasoning models often suffer from redundant computation and inefficiency in mathematical tasks due to fixed-length chain-of-thought reasoning. To address this, we propose a length-adaptive reasoning framework that internalizes inference depth control as an intrinsic model capability. First, we model the optimal reasoning length distribution via reinforcement learning; second, we introduce a metacognitive guidance mechanism that dynamically adjusts the number of reasoning steps based on contextual cues during inference. Crucially, our approach requires no external controller or manually specified thresholds, enabling end-to-end autonomous depth decision-making. Evaluated on mainstream mathematical reasoning benchmarks, our method reduces token consumption by 40.9% on average over strong baselines while improving accuracy by 2.3%, demonstrating the effectiveness and generalizability of adaptive computational resource allocation aligned with problem complexity.

Technology Category

Application Category

📝 Abstract

Large reasoning models have achieved remarkable performance through extended chain-of-thought sequences, yet this computational freedom leads to excessive token generation even for simple problems. We present Length-Adaptive Policy Optimization (LAPO), a novel framework that transforms reasoning length control from an external constraint into an intrinsic model capability. Unlike existing approaches that impose rigid limits or rely on post-hoc interventions, LAPO enables models to internalize an understanding of appropriate reasoning depth through a two-stage reinforcement learning process. In the first stage, models learn natural reasoning patterns by discovering the statistical distribution of successful solution lengths. The second stage leverages these patterns as meta-cognitive guidance, embedding them directly within the model's reasoning context to ensure inference-time flexibility. Experiments on mathematical reasoning benchmarks demonstrate that LAPO reduces token usage by up to 40.9% while improving accuracy by 2.3%. Our analysis reveals that models trained with LAPO develop emergent abilities to allocate computational resources based on problem complexity, achieving efficient reasoning without sacrificing quality.

Problem

Research questions and friction points this paper is trying to address.

Optimize reasoning length control in large models

Reduce excessive token generation for simple problems

Improve computational efficiency without sacrificing accuracy

Innovation

Methods, ideas, or system contributions that make the work stand out.

Length-adaptive reinforcement learning for reasoning

Two-stage internalization of reasoning depth

Dynamic token reduction with accuracy improvement

🔎 Similar Papers

Semantic Self-Consistency: Enhancing Language Model Reasoning via Semantic Weighting