Reward Generalization in RLHF: A Topological Perspective

📅 2024-02-15
📈 Citations: 4
Influential: 0
📄 PDF
🤖 AI Summary
Existing RLHF methods rely on a single, uncharacterized information-flow topology—human feedback → preference modeling → model alignment—leading to low data efficiency and unreliable generalization, with alternative topologies remaining unexplored. Method: This paper introduces the first topology-aware formalization of RLHF information flow, proposing a two-layer abstraction framework: behavioral distribution autoencoding coupled with an induced Bayesian network, which uncovers fundamental performance bottlenecks. It further designs a tree-structured preference modeling approach, theoretically proven to reduce reward uncertainty to Θ(log n / log log n) times that of baseline methods while enabling zero-cost generalization gains. Contribution/Results: Evaluated on three NLP tasks, the method achieves an average win rate of 65%, significantly improving reward generalization capability over prior approaches.

Technology Category

Application Category

📝 Abstract
Existing alignment methods share a common topology of information flow, where reward information is collected from humans, modeled with preference learning, and used to tune language models. However, this shared topology has not been systematically characterized, nor have its alternatives been thoroughly explored, leaving the problems of low data efficiency and unreliable generalization unaddressed. As a solution, we introduce a theoretical framework for investigating reward generalization in reinforcement learning from human feedback (RLHF), focusing on the topology of information flow at both macro and micro levels. At the macro level, we portray the RLHF information flow as an autoencoding process over behavior distributions, formalizing the RLHF objective of distributional consistency between human preference and model behavior. At the micro level, we present induced Bayesian networks as a theory of reward generalization in RLHF, introducing fine-grained dataset topologies into generalization bounds. Combining analysis on both levels, we propose reward modeling from tree-structured preference information. It is shown to reduce reward uncertainty by up to $Theta(log n/loglog n)$ times compared to baselines, where $n$ is the dataset size. Validation on three NLP tasks shows that our tree-based reward model achieves an average win rate of 65% against baseline methods, thus improving reward generalization for free via topology design.
Problem

Research questions and friction points this paper is trying to address.

Characterize shared topology in RLHF alignment methods
Address low data efficiency and unreliable generalization
Propose reward modeling from tree-structured preference information
Innovation

Methods, ideas, or system contributions that make the work stand out.

Macro-level autoencoding for distributional consistency
Micro-level induced Bayesian networks
Tree-structured preference modeling reduces uncertainty
🔎 Similar Papers
No similar papers found.
T
Tianyi Qiu
Center for AI Safety and Governance, Institute for AI, Peking University
Fanzhi Zeng
Fanzhi Zeng
UT Austin
Reinforcement LearningAI Alignment
J
Jiaming Ji
Center for AI Safety and Governance, Institute for AI, Peking University
Dong Yan
Dong Yan
AI Chief Expert, Bosch.
Reinforcement LearningFoundation Model
Kaile Wang
Kaile Wang
Peking University
J
Jiayi Zhou
Center for AI Safety and Governance, Institute for AI, Peking University
Y
Yang Han
Center for AI Safety and Governance, Institute for AI, Peking University
Josef Dai
Josef Dai
Zhejiang University
Alignment
Xuehai Pan
Xuehai Pan
Peking University
Multi-Agent LearningReinforcement LearningAI AlignmentAI Agents
Y
Yaodong Yang
Center for AI Safety and Governance, Institute for AI, Peking University