Multi-head Reward Aggregation Guided by Entropy

📅 2025-03-26

📈 Citations: 0

✨ Influential: 0

career value

231K/year

🤖 AI Summary

This paper addresses the reliability challenge in multi-criteria reward modeling for safety alignment of large language models (LLMs). We identify a significant negative correlation between rating entropy of safety rules and consistency with human preferences. Leveraging this insight, we propose an entropy-guided, training-free, and interpretable multi-head reward aggregation method within the Bradley–Terry framework: safety rules exhibiting higher entropy—indicating lower reliability—are automatically assigned lower weights, thereby improving overall preference prediction accuracy. The method is theoretically grounded and lightweight, requiring no fine-tuning. On RewardBench safety benchmarks, it substantially outperforms baselines including random weighting, uniform weighting, single-head Bradley–Terry, and LLM-based judges. Moreover, it demonstrates strong cross-dataset generalization and practical deployability.

Technology Category

Application Category

📝 Abstract

Aligning large language models (LLMs) with safety guidelines typically involves reinforcement learning from human feedback (RLHF), relying on human-generated preference annotations. However, assigning consistent overall quality ratings is challenging, prompting recent research to shift towards detailed evaluations based on multiple specific safety criteria. This paper uncovers a consistent observation: safety rules characterized by high rating entropy are generally less reliable in identifying responses preferred by humans. Leveraging this finding, we introduce ENCORE, a straightforward entropy-guided approach that composes multi-head rewards by downweighting rules exhibiting high rating entropy. Theoretically, we demonstrate that rules with elevated entropy naturally receive minimal weighting in the Bradley-Terry optimization framework, justifying our entropy-based penalization. Through extensive experiments on RewardBench safety tasks, our method significantly surpasses several competitive baselines, including random weighting, uniform weighting, single-head Bradley-Terry models, and LLM-based judging methods. Our proposed approach is training-free, broadly applicable to various datasets, and maintains interpretability, offering a practical and effective solution for multi-attribute reward modeling.

Problem

Research questions and friction points this paper is trying to address.

Aligning LLMs with safety guidelines using unreliable human ratings

Addressing inconsistent safety rule reliability via entropy analysis

Improving multi-attribute reward modeling without training requirements

Innovation

Methods, ideas, or system contributions that make the work stand out.

Entropy-guided multi-head reward aggregation

Training-free multi-attribute reward modeling

Downweighting high-entropy safety rules

🔎 Similar Papers

No similar papers found.