The Art of Efficient Reasoning: Data, Reward, and Optimization

📅 2026-02-24
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the challenge of balancing efficiency and accuracy in chain-of-thought reasoning with large language models, which often suffer from high computational costs. The authors propose a reinforcement learning–based mechanism that employs reward shaping to guide models toward generating concise yet accurate reasoning paths, coupled with a two-stage training paradigm: length adaptation followed by reasoning refinement. They find that focusing on easier prompts maintains high reward density and effectively mitigates length collapse, and that the learned length preferences generalize across domains. The approach demonstrates robustness across the Qwen3 model family (0.6B–30B), validated through approximately 200,000 GPU hours of experimentation. The study also introduces fine-grained evaluation metrics—such as length distributions conditioned on correctness—and distills practical, reproducible guidelines for training efficient reasoning models.

Technology Category

Application Category

📝 Abstract
Large Language Models (LLMs) consistently benefit from scaled Chain-of-Thought (CoT) reasoning, but also suffer from heavy computational overhead. To address this issue, efficient reasoning aims to incentivize short yet accurate thinking trajectories, typically through reward shaping with Reinforcement Learning (RL). In this paper, we systematically investigate the mechanics of efficient reasoning for LLMs. For comprehensive evaluation, we advocate for more fine-grained metrics, including length distribution conditioned on correctness and performance across a wide spectrum of token budgets ranging from 2k to 32k. First, we reveal that the training process follows a two-stage paradigm: length adaptation and reasoning refinement. After that, we conduct extensive experiments (about 0.2 million GPU hours) in a unified protocol, deconstructing training prompts and rollouts, reward shaping, and optimization strategies. In particular, a key finding is to train on relatively easier prompts, ensuring the density of positive reward signals and thus avoiding the length collapse. Meanwhile, the learned length bias can be generalized across domains. We distill all findings into valuable insights and practical guidelines, and further validate them across the Qwen3 series, ranging from 0.6B to 30B, demonstrating the robustness and generalization.
Problem

Research questions and friction points this paper is trying to address.

efficient reasoning
Chain-of-Thought
computational overhead
reasoning length
large language models
Innovation

Methods, ideas, or system contributions that make the work stand out.

Efficient Reasoning
Chain-of-Thought
Reinforcement Learning
Length Collapse
Reward Shaping
🔎 Similar Papers
No similar papers found.