Chasing the Tail: Effective Rubric-based Reward Modeling for Large Language Model Post-Training

📅 2025-09-25
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
In reinforcement fine-tuning, policy models often exploit “reward hacking” to achieve high scores while producing low-quality outputs—a consequence of insufficient discriminative power in the reward signal within the high-reward tail. To address this, we propose a rubric-based reward modeling approach that explicitly focuses on the high-reward tail. Leveraging non-policy samples, it constructs a fine-grained, multi-dimensional reward function capable of precisely distinguishing high-quality from exceptional responses. Our method introduces a robust high-quality diversity sampling workflow, utilizing strong models for offline sample generation and rewriting, supported by theoretical analysis and empirical validation. Experiments demonstrate that our approach significantly mitigates reward over-optimization, improving both response quality and alignment across multiple tasks. The implementation is publicly available.

Technology Category

Application Category

📝 Abstract
Reinforcement fine-tuning (RFT) often suffers from emph{reward over-optimization}, where a policy model hacks the reward signals to achieve high scores while producing low-quality outputs. Our theoretical analysis shows that the key lies in reward misspecification at the high-reward tail: the inability to reliably distinguish Excellent responses from merely Great ones. This motivate us to focus on the high-reward region. However, such tail examples are scarce under the base LLM. While off-policy exemplars (e.g. from stronger models or rewrites) are easier to obtain, naively training on them yields a misspecified reward for the policy we aim to align. To address this, we study rubric-based rewards. By design, rubrics can leverage off-policy examples while remaining insensitive to their artifacts. To elicit rubrics that capture the high-reward tail, we highlight the importance of distinguishing among great and diverse responses, and introduce a workflow to implement this idea. We empirically demonstrate that rubric-based rewards substantially mitigate reward over-optimization and deliver effective LLM post-training improvements. Our code can be accessed at https://github.com/Jun-Kai-Zhang/rubrics.git .
Problem

Research questions and friction points this paper is trying to address.

Addressing reward over-optimization in reinforcement fine-tuning of language models
Distinguishing excellent from great responses in high-reward tail regions
Mitigating reward misspecification using rubric-based reward modeling
Innovation

Methods, ideas, or system contributions that make the work stand out.

Rubric-based rewards mitigate reward over-optimization
Leveraging off-policy examples with rubric design
Distinguishing excellent from great responses effectively
🔎 Similar Papers
No similar papers found.