Interpreting Language Reward Models via Contrastive Explanations

📅 2024-11-25
🏛️ arXiv.org
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Reward models (RMs), critical for aligning large language models (LLMs), suffer from low interpretability due to their “black-box” nature, undermining trust. To address this, we propose the first contrastive explanation attribution framework specifically designed for RMs. Our method generates contrastive response pairs via attribute-controllable perturbations—e.g., factual accuracy or politeness—and analyzes the RM’s local decision logic in binary preference judgments. It requires no architectural modification or retraining, enabling plug-and-play deployment. Moreover, it supports both global sensitivity analysis and cross-model behavioral comparison across multiple RMs. Experiments quantitatively validate explanation fidelity and qualitatively reveal RMs’ differential sensitivity to distinct evaluation dimensions. By demystifying RM preferences, our framework significantly enhances the interpretability and trustworthiness of LLM alignment. (149 words)

Technology Category

Application Category

📝 Abstract
Reward models (RMs) are a crucial component in the alignment of large language models' (LLMs) outputs with human values. RMs approximate human preferences over possible LLM responses to the same prompt by predicting and comparing reward scores. However, as they are typically modified versions of LLMs with scalar output heads, RMs are large black boxes whose predictions are not explainable. More transparent RMs would enable improved trust in the alignment of LLMs. In this work, we propose to use contrastive explanations to explain any binary response comparison made by an RM. Specifically, we generate a diverse set of new comparisons similar to the original one to characterise the RM's local behaviour. The perturbed responses forming the new comparisons are generated to explicitly modify manually specified high-level evaluation attributes, on which analyses of RM behaviour are grounded. In quantitative experiments, we validate the effectiveness of our method for finding high-quality contrastive explanations. We then showcase the qualitative usefulness of our method for investigating global sensitivity of RMs to each evaluation attribute, and demonstrate how representative examples can be automatically extracted to explain and compare behaviours of different RMs. We see our method as a flexible framework for RM explanation, providing a basis for more interpretable and trustworthy LLM alignment.
Problem

Research questions and friction points this paper is trying to address.

Explain reward models' predictions
Improve trust in language model alignment
Analyze sensitivity to evaluation attributes
Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses contrastive explanations for clarity
Generates diverse response comparisons
Analyzes sensitivity to evaluation attributes
🔎 Similar Papers
No similar papers found.