MARS: Margin-Aware Reward-Modeling with Self-Refinement

📅 2026-02-19

📈 Citations: 0

✨ Influential: 0

career value

170K/year

🤖 AI Summary

Reward model training is hindered by the high cost and scarcity of human preference data, and existing data augmentation methods often overlook the discriminative difficulty of samples, limiting improvements in model robustness. This work proposes MARS, a novel framework that introduces boundary-awareness and self-refinement mechanisms into reward modeling for the first time. By adaptively sampling ambiguous preference instances where the model exhibits high uncertainty and iteratively enhancing hard examples, MARS dynamically reshapes the training distribution. Theoretically, this approach improves the curvature and condition number of the loss landscape. Empirically, it substantially outperforms uniform augmentation strategies across multiple benchmarks, effectively enhancing the reward model’s discriminative power, generalization capability, and optimization efficiency.

Technology Category

Application Category

📝 Abstract

Reward modeling is a core component of modern alignment pipelines including RLHF and RLAIF, underpinning policy optimization methods including PPO and TRPO. However, training reliable reward models relies heavily on human-labeled preference data, which is costly and limited, motivating the use of data augmentation. Existing augmentation approaches typically operate at the representation or semantic level and remain agnostic to the reward model's estimation difficulty. In this paper, we propose MARS, an adaptive, margin-aware augmentation and sampling strategy that explicitly targets ambiguous and failure modes of the reward model. Our proposed framework, MARS, concentrates augmentation on low-margin (ambiguous) preference pairs where the reward model is most uncertain, and iteratively refines the training distribution via hard-sample augmentation. We provide theoretical guarantees showing that this strategy increases the average curvature of the loss function hence enhance information and improves conditioning, along with empirical results demonstrating consistent gains over uniform augmentation for robust reward modeling.

Problem

Research questions and friction points this paper is trying to address.

reward modeling

data augmentation

preference data

margin-aware

alignment

Innovation

Methods, ideas, or system contributions that make the work stand out.

margin-aware

reward modeling

data augmentation