Thompson Sampling in Online RLHF with General Function Approximation

๐Ÿ“… 2025-05-29
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
This paper studies statistically efficient online reinforcement learning from human feedback (RLHF), focusing on value-function approximation and regret minimization under streaming preference data. We propose the first model-free posterior-sampling algorithm for online RLHF based on Thompson sampling, compatible with general function approximators. Our theoretical contributions are threefold: (i) the first incorporation of Thompson sampling into a rigorous online RLHF framework; (ii) the introduction of the Bellman eluder dimension to characterize the complexity of function classes in RLHF; and (iii) a novel concentration inequality for squared Bellman error, derived from maximum likelihood estimation generalization bounds, enabling an eluder-type regret analysis. We establish a $ ilde{O}(sqrt{T})$ regret upper bound that scales with the horizon $T$, the Bellman eluder dimension, and the log-bracketing number of the function classโ€”yielding the first model-agnostic online RLHF algorithm with provable statistical efficiency.

Technology Category

Application Category

๐Ÿ“ Abstract
Reinforcement learning from human feedback (RLHF) has achieved great empirical success in aligning large language models (LLMs) with human preference, and it is of great importance to study the statistical efficiency of RLHF algorithms from a theoretical perspective. In this work, we consider the online RLHF setting where the preference data is revealed during the learning process and study action value function approximation. We design a model-free posterior sampling algorithm for online RLHF inspired by Thompson sampling and provide its theoretical guarantee. Specifically, we adopt Bellman eluder (BE) dimension as the complexity measure of the function class and establish $O(sqrt{T})$ regret bound for the proposed algorithm with other multiplicative factor depending on the horizon, BE dimension and the $log$-bracketing number of the function class. Further, in the analysis, we first establish the concentration-type inequality of the squared Bellman error bound based on the maximum likelihood estimator (MLE) generalization bound, which plays the crucial rules in obtaining the eluder-type regret bound and may be of independent interest.
Problem

Research questions and friction points this paper is trying to address.

Studies statistical efficiency of online RLHF algorithms theoretically
Designs model-free Thompson sampling for action value approximation
Establishes regret bound using Bellman eluder dimension analysis
Innovation

Methods, ideas, or system contributions that make the work stand out.

Model-free posterior sampling for RLHF
Bellman eluder dimension as complexity measure
O(sqrt{T}) regret bound with MLE
๐Ÿ”Ž Similar Papers
No similar papers found.
Songtao Feng
Songtao Feng
The University of Florida
J
Jie Fu
Department of Electrical and Computer Engineering, University of Florida, FL 32603, USA