From Demonstrations to Rewards: Alignment Without Explicit Human Preferences

📅 2025-03-15
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the high cost and complexity of human preference annotation in large language model (LLM) alignment, this paper proposes a novel inverse reinforcement learning (IRL) paradigm that relies solely on demonstration data. We theoretically establish that demonstration data implicitly encodes human preferences, thereby eliminating the need for explicit preference labeling in conventional RLHF and enabling end-to-end alignment. Methodologically, we introduce a joint optimization framework that simultaneously trains the policy and infers the reward function, ensuring compatibility with mainstream LLM architectures and open-source evaluation benchmarks. Experiments demonstrate that our approach achieves performance on par with or superior to current state-of-the-art demonstration-only methods across the Hugging Face Open LLM Leaderboard, MT-Bench, and public reward modeling benchmarks—while substantially reducing both data curation effort and engineering overhead.

Technology Category

Application Category

📝 Abstract
One of the challenges of aligning large models with human preferences lies in both the data requirements and the technical complexities of current approaches. Predominant methods, such as RLHF, involve multiple steps, each demanding distinct types of data, including demonstration data and preference data. In RLHF, human preferences are typically modeled through a reward model, which serves as a proxy to guide policy learning during the reinforcement learning stage, ultimately producing a policy aligned with human preferences. However, in this paper, we propose a fresh perspective on learning alignment based on inverse reinforcement learning principles, where the optimal policy is still derived from reward maximization. However, instead of relying on preference data, we directly learn the reward model from demonstration data. This new formulation offers the flexibility to be applied even when only demonstration data is available, a capability that current RLHF methods lack, and it also shows that demonstration data offers more utility than what conventional wisdom suggests. Our extensive evaluation, based on public reward benchmark, HuggingFace Open LLM Leaderboard and MT-Bench, demonstrates that our approach compares favorably to state-of-the-art methods that rely solely on demonstration data.
Problem

Research questions and friction points this paper is trying to address.

Aligning large models with human preferences efficiently
Reducing reliance on explicit human preference data
Learning reward models directly from demonstration data
Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses inverse reinforcement learning principles
Learns reward model from demonstration data
Applies without needing explicit preference data
🔎 Similar Papers
No similar papers found.