Thinking-Based Non-Thinking: Solving the Reward Hacking Problem in Training Hybrid Reasoning Models via Reinforcement Learning

πŸ“… 2026-01-08
πŸ›οΈ arXiv.org
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
This work addresses the susceptibility of hybrid reasoning models in reinforcement learning to β€œreward hacking,” wherein models misattribute reasoning behavior, leading to suboptimal decisions. The authors propose a training method that requires no supervised fine-tuning and instead dynamically analyzes solution-relevant information within reasoning trajectories to adaptively cap the maximum generation length for non-reasoning responses per query. This approach effectively mitigates reward hacking while maintaining computational efficiency. Evaluated across five mathematical benchmarks, the method significantly outperforms DeepSeek-R1-Distill-Qwen-1.5B/7B and DeepScaleR-1.5B, achieving higher accuracy while reducing token consumption by approximately 50% and constraining the probability of reward hacking to below 10%.

Technology Category

Application Category

πŸ“ Abstract
Large reasoning models (LRMs) have attracted much attention due to their exceptional performance. However, their performance mainly stems from thinking, a long Chain of Thought (CoT), which significantly increase computational overhead. To address this overthinking problem, existing work focuses on using reinforcement learning (RL) to train hybrid reasoning models that automatically decide whether to engage in thinking or not based on the complexity of the query. Unfortunately, using RL will suffer the the reward hacking problem, e.g., the model engages in thinking but is judged as not doing so, resulting in incorrect rewards. To mitigate this problem, existing works either employ supervised fine-tuning (SFT), which incurs high computational costs, or enforce uniform token limits on non-thinking responses, which yields limited mitigation of the problem. In this paper, we propose Thinking-Based Non-Thinking (TNT). It does not employ SFT, and sets different maximum token usage for responses not using thinking across various queries by leveraging information from the solution component of the responses using thinking. Experiments on five mathematical benchmarks demonstrate that TNT reduces token usage by around 50% compared to DeepSeek-R1-Distill-Qwen-1.5B/7B and DeepScaleR-1.5B, while significantly improving accuracy. In fact, TNT achieves the optimal trade-off between accuracy and efficiency among all tested methods. Additionally, the probability of reward hacking problem in TNT's responses, which are classified as not using thinking, remains below 10% across all tested datasets.
Problem

Research questions and friction points this paper is trying to address.

reward hacking
hybrid reasoning models
reinforcement learning
Chain of Thought
computational overhead
Innovation

Methods, ideas, or system contributions that make the work stand out.

reward hacking
hybrid reasoning models
Chain of Thought
reinforcement learning
token efficiency
πŸ”Ž Similar Papers
No similar papers found.
S
Siyuan Gan
National Key Laboratory for Novel Software Technology, Nanjing University, Nanjing, China
J
Jiaheng Liu
National Key Laboratory for Novel Software Technology, Nanjing University, Nanjing, China
B
Boyan Wang
National Key Laboratory for Novel Software Technology, Nanjing University, Nanjing, China
Tianpei Yang
Tianpei Yang
Nanjing University
Reinforcement LearningTransfer LearningMultiagent SystemsAI agents
R
Runqing Miao
Jiutian Research, Beijing, China
Yuyao Zhang
Yuyao Zhang
Renmin University of China
Artificial Intelligence
F
Fanyu Meng
Jiutian Research, Beijing, China
Junlan Feng
Junlan Feng
Chief Scientist at China Mobile Research
Natural LanguageMachine LearningSpeech ProcessingData Mining
L
Linjian Meng
National Key Laboratory for Novel Software Technology, Nanjing University, Nanjing, China
Jing Huo
Jing Huo
Nanjing University
Machine LearningComputer Vision
Yang Gao
Yang Gao
Nanjing University, China
Artificial IntelligenceMachine LearningMulti-agent SystemsBig DataImage/Video Process