When Distance Distracts: Representation Distance Bias in BT-Loss for Reward Models

📅 2025-12-06

📈 Citations: 0

✨ Influential: 0

career value

216K/year

🤖 AI Summary

In reward modeling, the Bradley–Terry (BT) loss suffers from representation-distance interference: hard-to-distinguish sample pairs with small embedding distances yield weak gradients, while pairs with large distances dominate optimization—degrading fine-grained ranking capability. We identify this “representation-distance bias” as the root cause of gradient imbalance. To address it, we propose NormBT—a lightweight, plug-and-play pairwise gradient normalization method that eliminates the influence of representation distance on gradient norms, ensuring updates depend solely on predicted reward differences. Theoretically, NormBT preserves the original BT objective without introducing computational overhead. Extensive experiments across multiple large language models and datasets demonstrate that NormBT significantly improves reward modeling accuracy, yielding an average gain of over 5% on RewardBench reasoning tasks.

Technology Category

Application Category

📝 Abstract

Reward models are central to Large Language Model (LLM) alignment within the framework of RLHF. The standard objective used in reward modeling is the Bradley-Terry (BT) loss, which learns from pairwise data consisting of a pair of chosen and rejected responses. In this work, we analyze the per-sample gradient of BT-loss and show that its norm scales with two distinct components: (1) the difference in predicted rewards between chosen and rejected responses, which reflects the prediction error, and critically, (2) representation distance between the pair measured in the output space of the final layer. While the first term captures the intended training signal, we show that the second term can significantly impact the update magnitude and misalign learning. Specifically, pairs with small representation distance often receive vanishingly weak updates, even when misranked, while pairs with large distance receive disproportionately strong updates. This leads to gradients from large-distance pairs to overshadow those from small-distance pairs, where fine-grained distinctions are especially important. To overcome this limitation, we propose NormBT, an adaptive pair-wise normalization scheme that balances representation-driven effects and focuses learning signals on prediction error. NormBT is a lightweight, drop-in integration to BT loss with negligible overhead. Across various LLM backbones and datasets, NormBT improves reward model performance consistently, with notable gains of over 5% on the Reasoning category of RewardBench, which contains numerous small-distance pairs. This work reveals a key limitation in the widely used BT objective and provides a simple, effective correction.

Problem

Research questions and friction points this paper is trying to address.

BT-loss gradient norm scales with representation distance bias

Large-distance pairs overshadow small-distance pairs in updates

NormBT balances representation effects to focus on prediction error

Innovation

Methods, ideas, or system contributions that make the work stand out.

Adaptive pairwise normalization balances representation-driven effects

Focuses learning signals on prediction error for reward models

Lightweight drop-in integration to BT loss with negligible overhead

🔎 Similar Papers

Analyzing and Bridging the Gap between Maximizing Total Reward and Discounted Reward in Deep Reinforcement Learning