🤖 AI Summary
This work addresses the susceptibility of existing reward models to spurious features—such as response length—during the alignment of large language models, which introduces bias into preference judgments. The authors propose a causally inspired, neuron-level intervention method applied at inference time that identifies and dynamically suppresses activations in early-layer neurons strongly associated with bias-inducing attributes. This approach enables unified debiasing across multiple bias types without compromising model performance. Experiments demonstrate that the method significantly reduces sensitivity to spurious features across several benchmarks. Notably, debiased 2B and 7B reward models achieve alignment performance on AlpacaEval and MT-Bench comparable to that of 70B-scale models, marking the first demonstration of efficient and general-purpose debiasing at inference time.
📝 Abstract
Reward models (RMs) play a central role in aligning large language models (LLMs) with human preferences. However, RMs are often sensitive to spurious features such as response length. Existing inference-time approaches for mitigating these biases typically focus exclusively on response length, resulting in performance trade-offs. In this paper, we propose causally motivated intervention for mitigating multiple types of biases in RMs at inference time. Our method first identifies neurons whose activations are strongly correlated with predefined bias attributes, and applies neuron-level intervention that suppresses these signals. We evaluate our method on RM benchmarks and observe reductions in sensitivity to spurious features across diverse bias types, without inducing performance trade-offs. Moreover, when used for preference annotation, small RMs (2B and 7B) with our method, which edits less than 2% of all the neurons in RMs, enable LLMs to improve alignment, achieving performance comparable to that of a state-of-the-art 70B RM on AlpacaEval and MT-Bench. Further analysis reveals that bias signals are primarily encoded by neurons in early layers, shedding light on the internal mechanisms of bias exploitation in RMs.