π€ AI Summary
Existing reward model evaluation relies heavily on fixed pairwise ranking benchmarks, lacking fine-grained, interpretable assessment across distinct preference dimensions.
Method: We propose a multidimensional preference evaluation framework featuring (i) MRMBenchβa novel benchmark comprising six diagnostic tasks that systematically quantify reward model performance across key dimensions (e.g., factual consistency, safety, conciseness); and (ii) an inference-time probing technique that dynamically disentangles preference representations and outputs dimension-level confidence scores.
Results: MRMBench scores exhibit strong correlation with downstream LLM alignment performance, enabling effective reward modeling guidance. The probing method significantly improves prediction reliability and interpretability, uncovering latent trade-offs and biases in multi-objective optimization. Collectively, our framework establishes a new paradigm for diagnosing reward models and advancing preference-aligned AI development.
π Abstract
Previous methods evaluate reward models by testing them on a fixed pairwise ranking test set, but they typically do not provide performance information on each preference dimension. In this work, we address the evaluation challenge of reward models by probing preference representations. To confirm the effectiveness of this evaluation method, we construct a Multi-dimensional Reward Model Benchmark (MRMBench), a collection of six probing tasks for different preference dimensions. We design it to favor and encourage reward models that better capture preferences across different dimensions. Furthermore, we introduce an analysis method, inference-time probing, which identifies the dimensions used during the reward prediction and enhances its interpretability. Through extensive experiments, we find that MRMBench strongly correlates with the alignment performance of large language models (LLMs), making it a reliable reference for developing advanced reward models. Our analysis of MRMBench evaluation results reveals that reward models often struggle to capture preferences across multiple dimensions, highlighting the potential of multi-objective optimization in reward modeling. Additionally, our findings show that the proposed inference-time probing method offers a reliable metric for assessing the confidence of reward predictions, which ultimately improves the alignment of LLMs.