Shorter but not Worse: Frugal Reasoning via Easy Samples as Length Regularizers in Math RLVR

📅 2025-11-02
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Large language models (LLMs) often generate excessively long reasoning chains in mathematical problem solving, increasing computational overhead. Existing reinforcement learning with verifiable rewards (RLVR) methods exacerbate the “longer is better” bias by filtering out easy samples and focusing solely on hard ones. This work proposes an implicit length regularization strategy: within the RLVR framework, we retain medium-difficulty samples and apply moderate reweighting to them, leveraging their inherent difficulty distribution as a natural length regulator—eliminating the need for explicit length penalties while encouraging more concise and accurate reasoning paths. Evaluated on AIME25, our method maintains baseline accuracy while reducing average reasoning chain length by 47%, significantly improving inference efficiency. The core contribution lies in uncovering and harnessing the implicit regulatory effect of sample difficulty distribution on generated reasoning length, establishing a novel paradigm for efficient and controllable reasoning training.

Technology Category

Application Category

📝 Abstract
Large language models (LLMs) trained for step-by-step reasoning often become excessively verbose, raising inference cost. Standard Reinforcement Learning with Verifiable Rewards (RLVR) pipelines filter out ``easy''problems for training efficiency, leaving the model to train primarily on harder problems that require longer reasoning chains. This skews the output length distribution upward, resulting in a extbf{model that conflates ``thinking longer''with ``thinking better''}. In this work, we show that retaining and modestly up-weighting moderately easy problems acts as an implicit length regularizer. Exposing the model to solvable short-chain tasks constrains its output distribution and prevents runaway verbosity. The result is extbf{emph{emergent brevity for free}}: the model learns to solve harder problems without inflating the output length, extbf{ despite the absence of any explicit length penalization}. RLVR experiments using this approach on extit{Qwen3-4B-Thinking-2507} (with a 16k token limit) achieve baseline pass@1 AIME25 accuracy while generating solutions that are, on average, nearly twice as short. The code is available at href{https://github.com/MBZUAI-Paris/Frugal-AI}{GitHub}, with datasets and models on href{https://huggingface.co/collections/MBZUAI-Paris/k2-think-mini-68dcfa8b114686a4bd3dc2bc}{Hugging Face}.
Problem

Research questions and friction points this paper is trying to address.

LLMs trained for reasoning become excessively verbose
Standard RLVR training skews output length distribution upward
Model conflates longer reasoning chains with better performance
Innovation

Methods, ideas, or system contributions that make the work stand out.

Retaining easy problems as implicit length regularizer
Up-weighting short-chain tasks to constrain output
Achieving emergent brevity without explicit length penalization
A
Abdelaziz Bounhar
MBZUAI
Hadi Abdine
Hadi Abdine
Institute of Foundation Models - MBZUAI Paris
LLMNLPDeep LearningTransformerMachine Learning
E
Evan Dufraisse
MBZUAI
A
Ahmad Chamma
MBZUAI
A
Amr Mohamed
MBZUAI
D
Dani Bouch
MBZUAI
M
M. Vazirgiannis
MBZUAI, École Polytechnique
Guokan Shang
Guokan Shang
MBZUAI-IFM Paris Lab