Anti-Length Shift: Dynamic Outlier Truncation for Training Efficient Reasoning Models

📅 2026-01-07
🏛️ arXiv.org
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work identifies and formalizes a critical issue in large language reasoning models trained via reinforcement learning: reward-induced “length bias,” wherein models generate unnecessarily verbose reasoning for simple problems, substantially increasing deployment costs. To address this, the authors propose Dynamic Outlier Truncation (DOT), a lightweight intervention that selectively suppresses tail tokens of excessively long responses—but only when the full generation is correct—thereby preserving the model’s capacity for long-chain reasoning on complex tasks. DOT is combined with KL-divergence regularization and predictive dynamic sampling to ensure training stability. Experiments across multiple model scales demonstrate that DOT significantly advances the efficiency–accuracy Pareto frontier, reducing inference tokens by 78% on AIME-24 while outperforming both the initial policy and existing efficient reasoning methods in accuracy.

Technology Category

Application Category

📝 Abstract
Large reasoning models enhanced by reinforcement learning with verifiable rewards have achieved significant performance gains by extending their chain-of-thought. However, this paradigm incurs substantial deployment costs as models often exhibit excessive verbosity on simple queries. Existing efficient reasoning methods relying on explicit length penalties often introduce optimization conflicts and leave the generative mechanisms driving overthinking largely unexamined. In this paper, we identify a phenomenon termed length shift where models increasingly generate unnecessary reasoning on trivial inputs during training. To address this, we introduce Dynamic Outlier Truncation (DOT), a training-time intervention that selectively suppresses redundant tokens. This method targets only the extreme tail of response lengths within fully correct rollout groups while preserving long-horizon reasoning capabilities for complex problems. To complement this intervention and ensure stable convergence, we further incorporate auxiliary KL regularization and predictive dynamic sampling. Experimental results across multiple model scales demonstrate that our approach significantly pushes the efficiency-performance Pareto frontier outward. Notably, on the AIME-24, our method reduces inference token usage by 78% while simultaneously increasing accuracy compared to the initial policy and surpassing state-of-the-art efficient reasoning methods.
Problem

Research questions and friction points this paper is trying to address.

length shift
efficient reasoning
overthinking
redundant reasoning
reasoning verbosity
Innovation

Methods, ideas, or system contributions that make the work stand out.

Dynamic Outlier Truncation
Length Shift
Efficient Reasoning
KL Regularization
Pareto Frontier
🔎 Similar Papers
No similar papers found.