RewardAnything: Generalizable Principle-Following Reward Models

📅 2025-06-04

📈 Citations: 0

✨ Influential: 0

career value

192K/year

🤖 AI Summary

Existing reward models (RMs) rely on static preference datasets, limiting adaptability to diverse human preferences (e.g., conciseness vs. comprehensiveness) and incurring high retraining costs and bias risks. This work proposes a principle-following RM architecture enabling dynamic, natural-language-principle-guided score calibration. Our contributions are threefold: (1) We introduce the first principle-conditioned modeling paradigm, jointly trained via instruction tuning and multi-task synthetic data generation grounded in diverse principles; (2) We construct RABench—the first benchmark explicitly designed to evaluate principle generalization capability of RMs; (3) We achieve zero-shot principle adaptation, enabling immediate response to novel principles without fine-tuning. Experiments demonstrate state-of-the-art performance on standard RM benchmarks and substantial gains on RABench over prior methods. Moreover, our RM supports plug-and-play reinforcement learning from human feedback (RLHF) alignment, facilitating efficient and controllable preference optimization.

Technology Category

Application Category

📝 Abstract

Reward Models, essential for guiding Large Language Model optimization, are typically trained on fixed preference datasets, resulting in rigid alignment to single, implicit preference distributions. This prevents adaptation to diverse real-world needs-from conciseness in one task to detailed explanations in another. The standard practice of collecting task-specific preference data and retraining reward models is resource-intensive, often producing biased rewards, and limits practical application. We introduce generalizable, principle-following reward models. We propose that RMs should understand and adhere to dynamically provided natural language specifications of reward principles, similar to instruction-following in LLMs. To measure this capability, we develop RABench, a comprehensive benchmark for RMs focusing on generalization across diverse principles. Evaluations on RABench reveal poor generalization of current RMs. As a solution, we present RewardAnything, a novel RM designed and trained to explicitly follow natural language principles. We achieve SotA performance with RewardAnything in traditional RM benchmark simply by specifying a well-defined principle, and results on RABench show we excel in adapting to novel principles without retraining. Furthermore, RewardAnything integrates seamlessly with existing RLHF methods and we show by a case study on how to automatically and efficiently align LLMs with only natural language principles.

Problem

Research questions and friction points this paper is trying to address.

Reward models lack adaptability to diverse real-world needs

Task-specific reward model training is resource-intensive and biased

Current reward models poorly generalize across different principles

Innovation

Methods, ideas, or system contributions that make the work stand out.

Generalizable reward models follow dynamic principles

RABench benchmark tests reward model generalization

RewardAnything adapts to principles without retraining

🔎 Similar Papers

RLeXplore: Accelerating Research in Intrinsically-Motivated Reinforcement Learning