đ¤ AI Summary
This work challenges the conventional paradigm of instruction-following methods that rely on a mixture of hard and soft constraints to improve generalization, which often suffers from reward hacking due to imprecise reward signals. Instead, the authors propose a data optimization strategy centered on reward precision: they fine-tune models via reinforcement learning using only high-precision hard-constraint rewards, augmented by an LLM-based critic and attention analysis to filter high-quality training data. Experiments demonstrate that this approach achieves an average performance gain of 13.4% across five benchmarks, reduces training time by 58%, and exhibits stronger generalization on both unseen instructions and non-instruction tasks. These findings underscore reward precision as a critical factor for effective alignment and reveal its role in cultivating transferable instruction-following meta-skills.
đ Abstract
A central belief in scaling reinforcement learning with verifiable rewards for instruction following (IF) tasks is that, a diverse mixture of verifiable hard and unverifiable soft constraints is essential for generalizing to unseen instructions. In this work, we challenge this prevailing consensus through a systematic empirical investigation. Counter-intuitively, we find that models trained on hard-only constraints consistently outperform those trained on mixed datasets. Extensive experiments reveal that reward precision, rather than constraint diversity, is the primary driver of effective alignment. The LLM judge suffers from a low recall rate in detecting false response, which leads to severe reward hacking, thereby undermining the benefits of diversity. Furthermore, analysis of the attention mechanism reveals that high-precision rewards develop a transferable meta-skill for IF. Motivated by these insights, we propose a simple yet effective data-centric refinement strategy that prioritizes reward precision. Evaluated on five benchmarks, our approach outperforms competitive baselines by 13.4\% in performance while achieving a 58\% reduction in training time, maintaining strong generalization beyond instruction following. Our findings advocate for a paradigm shift: moving away from the indiscriminate pursuit of data diversity toward high-precision rewards.