Precision over Diversity: High-Precision Reward Generalizes to Robust Instruction Following

📅 2026-01-08
🏛️ arXiv.org
📈 Citations: 0
✨ Influential: 0
📄 PDF
🤖 AI Summary
This work challenges the conventional paradigm of instruction-following methods that rely on a mixture of hard and soft constraints to improve generalization, which often suffers from reward hacking due to imprecise reward signals. Instead, the authors propose a data optimization strategy centered on reward precision: they fine-tune models via reinforcement learning using only high-precision hard-constraint rewards, augmented by an LLM-based critic and attention analysis to filter high-quality training data. Experiments demonstrate that this approach achieves an average performance gain of 13.4% across five benchmarks, reduces training time by 58%, and exhibits stronger generalization on both unseen instructions and non-instruction tasks. These findings underscore reward precision as a critical factor for effective alignment and reveal its role in cultivating transferable instruction-following meta-skills.

Technology Category

Application Category

📝 Abstract
A central belief in scaling reinforcement learning with verifiable rewards for instruction following (IF) tasks is that, a diverse mixture of verifiable hard and unverifiable soft constraints is essential for generalizing to unseen instructions. In this work, we challenge this prevailing consensus through a systematic empirical investigation. Counter-intuitively, we find that models trained on hard-only constraints consistently outperform those trained on mixed datasets. Extensive experiments reveal that reward precision, rather than constraint diversity, is the primary driver of effective alignment. The LLM judge suffers from a low recall rate in detecting false response, which leads to severe reward hacking, thereby undermining the benefits of diversity. Furthermore, analysis of the attention mechanism reveals that high-precision rewards develop a transferable meta-skill for IF. Motivated by these insights, we propose a simple yet effective data-centric refinement strategy that prioritizes reward precision. Evaluated on five benchmarks, our approach outperforms competitive baselines by 13.4\% in performance while achieving a 58\% reduction in training time, maintaining strong generalization beyond instruction following. Our findings advocate for a paradigm shift: moving away from the indiscriminate pursuit of data diversity toward high-precision rewards.
Problem

Research questions and friction points this paper is trying to address.

instruction following
reward precision
constraint diversity
reinforcement learning
generalization
Innovation

Methods, ideas, or system contributions that make the work stand out.

reward precision
instruction following
reinforcement learning
reward hacking
data-centric refinement
🔎 Similar Papers
No similar papers found.
Y
Yirong Zeng
Harbin Institute of Technology, SCIR
Y
Yufei Liu
Peking University
Xiao Ding
Xiao Ding
Harbin Institute of Technology
Natural Language ProcessingArtificial Intelligence
Yutai Hou
Yutai Hou
Huawei
LLMNLPDialogueAlignmentMeta Learning
Y
Yuxian Wang
Huawei Technologies Co., Ltd
H
Haonan Song
Huawei Technologies Co., Ltd
W
Wu Ning
Huawei Technologies Co., Ltd
D
Dandan Tu
Huawei Technologies Co., Ltd
Qixun Zhang
Qixun Zhang
北京邮电大学教授
wireless communication
Bibo Cai
Bibo Cai
Harbin Institute Technology
NLP
Y
Yuxiang He
Harbin Institute of Technology, SCIR
Ting Liu
Ting Liu
Shanghai Jiao Tong University
artificial intelligence