Learn to Reason Efficiently with Adaptive Length-based Reward Shaping

📅 2025-05-21
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Large reasoning models (LRMs) often generate redundant reasoning steps in reinforcement learning (RL)-driven chain-of-thought (CoT) inference, severely compromising efficiency. To address this, we propose a length-aware reward shaping framework, introducing LASER—the first method to jointly optimize reasoning performance and response conciseness via dynamic step-level rewards, query difficulty–adaptive target length scheduling, and CoT compression. During training, LASER enables reward self-adaptation and controllable length generation, achieving Pareto-optimal trade-offs between accuracy and efficiency. We further present LASER-D, a difficulty-aware variant that refines length scheduling based on input complexity. Evaluated on the AIME2024 benchmark, LASER achieves a +6.1-point accuracy gain while reducing token consumption by 63%, markedly suppressing unproductive “self-reflection” and yielding more compact, effective reasoning paths.

Technology Category

Application Category

📝 Abstract
Large Reasoning Models (LRMs) have shown remarkable capabilities in solving complex problems through reinforcement learning (RL), particularly by generating long reasoning traces. However, these extended outputs often exhibit substantial redundancy, which limits the efficiency of LRMs. In this paper, we investigate RL-based approaches to promote reasoning efficiency. Specifically, we first present a unified framework that formulates various efficient reasoning methods through the lens of length-based reward shaping. Building on this perspective, we propose a novel Length-bAsed StEp Reward shaping method (LASER), which employs a step function as the reward, controlled by a target length. LASER surpasses previous methods, achieving a superior Pareto-optimal balance between performance and efficiency. Next, we further extend LASER based on two key intuitions: (1) The reasoning behavior of the model evolves during training, necessitating reward specifications that are also adaptive and dynamic; (2) Rather than uniformly encouraging shorter or longer chains of thought (CoT), we posit that length-based reward shaping should be difficulty-aware i.e., it should penalize lengthy CoTs more for easy queries. This approach is expected to facilitate a combination of fast and slow thinking, leading to a better overall tradeoff. The resulting method is termed LASER-D (Dynamic and Difficulty-aware). Experiments on DeepSeek-R1-Distill-Qwen-1.5B, DeepSeek-R1-Distill-Qwen-7B, and DeepSeek-R1-Distill-Qwen-32B show that our approach significantly enhances both reasoning performance and response length efficiency. For instance, LASER-D and its variant achieve a +6.1 improvement on AIME2024 while reducing token usage by 63%. Further analysis reveals our RL-based compression produces more concise reasoning patterns with less redundant"self-reflections". Resources are at https://github.com/hkust-nlp/Laser.
Problem

Research questions and friction points this paper is trying to address.

Reducing redundancy in Large Reasoning Models outputs
Improving reasoning efficiency via adaptive reward shaping
Balancing performance and efficiency in reasoning tasks
Innovation

Methods, ideas, or system contributions that make the work stand out.

Length-bAsed StEp Reward shaping (LASER) method
Dynamic and Difficulty-aware (LASER-D) approach
Adaptive length-based reward shaping for efficiency
🔎 Similar Papers
No similar papers found.