🤖 AI Summary
This paper addresses the reliability challenge in multi-criteria reward modeling for safety alignment of large language models (LLMs). We identify a significant negative correlation between rating entropy of safety rules and consistency with human preferences. Leveraging this insight, we propose an entropy-guided, training-free, and interpretable multi-head reward aggregation method within the Bradley–Terry framework: safety rules exhibiting higher entropy—indicating lower reliability—are automatically assigned lower weights, thereby improving overall preference prediction accuracy. The method is theoretically grounded and lightweight, requiring no fine-tuning. On RewardBench safety benchmarks, it substantially outperforms baselines including random weighting, uniform weighting, single-head Bradley–Terry, and LLM-based judges. Moreover, it demonstrates strong cross-dataset generalization and practical deployability.
📝 Abstract
Aligning large language models (LLMs) with safety guidelines typically involves reinforcement learning from human feedback (RLHF), relying on human-generated preference annotations. However, assigning consistent overall quality ratings is challenging, prompting recent research to shift towards detailed evaluations based on multiple specific safety criteria. This paper uncovers a consistent observation: safety rules characterized by high rating entropy are generally less reliable in identifying responses preferred by humans. Leveraging this finding, we introduce ENCORE, a straightforward entropy-guided approach that composes multi-head rewards by downweighting rules exhibiting high rating entropy. Theoretically, we demonstrate that rules with elevated entropy naturally receive minimal weighting in the Bradley-Terry optimization framework, justifying our entropy-based penalization. Through extensive experiments on RewardBench safety tasks, our method significantly surpasses several competitive baselines, including random weighting, uniform weighting, single-head Bradley-Terry models, and LLM-based judging methods. Our proposed approach is training-free, broadly applicable to various datasets, and maintains interpretability, offering a practical and effective solution for multi-attribute reward modeling.