When In-Distribution Gains Fail: Evaluating Weak-to-Strong Reward Models under Preference Shift

📅 2026-05-25

📈 Citations: 0

✨ Influential: 0

career value

198K/year

🤖 AI Summary

Weak-to-strong (W2S) preference learning performs well in-distribution but suffers a significant drop in generalization under zero-shot distribution shifts across preference datasets. This work identifies that weakly supervised fine-tuning tends to cause the strong reward model to overfit to source-domain features, leading to representation drift that degrades out-of-domain transfer performance. To mitigate this issue, the authors propose an Anchor regularization method that constrains the model’s representation space during fine-tuning to remain close to its pre-trained state. Experimental results demonstrate that this approach substantially improves out-of-distribution generalization across diverse preference domains, datasets, and model families, while maintaining competitive in-distribution performance.

📝 Abstract

Weak-to-strong (W2S) generalization is a promising framework for scalable oversight, yet existing evaluations often test students under matched train--test distributions. Therefore, we study W2S preference learning under zero-shot distribution shift and find that strong students trained on weak preference labels can appear successful in-distribution while failing to transfer across preference datasets. We provide evidence for a representational failure mode in which weak-supervised fine-tuning can pull the strong model toward source-domain features instead of maintaining broadly transferable preference representations. To mitigate this, we propose Representation Anchoring (Anchor), a simple yet effective regularizer that constrains excessive drift from the pretrained strong model's representation space during fine-tuning, while still allowing task-relevant adaptation. Across preference domains, datasets, and model families, Anchor consistently improves out-of-distribution transfer while maintaining competitive in-distribution performance. Together, our evaluation protocol, transfer-aware metrics, and method expose hidden brittleness in current W2S reward modeling and provide a practical path toward more robust preference transfer.

Problem

Research questions and friction points this paper is trying to address.

weak-to-strong generalization

preference shift

distribution shift

reward modeling

representation drift

Innovation

Methods, ideas, or system contributions that make the work stand out.

Weak-to-Strong Generalization

Preference Shift

Representation Anchoring