Data-dependent Exploration for Online Reinforcement Learning from Human Feedback

📅 2026-05-05

📈 Citations: 0

✨ Influential: 0

career value

217K/year

📝 Abstract

Online reinforcement learning from human feedback (RLHF) has emerged as a promising paradigm for aligning large language models (LLMs) by continuously collecting new preference feedback during training. A foundational challenge in this setting is exploration, which requires algorithms that enable the LLMs to generate informative comparisons that improve sample-efficiency in online RLHF. Existing exploration strategies often derive bonuses via on-policy expectations, which are difficult to estimate reliably from the limited historical preference data available during training; as a result, the policy can prematurely down-weight under-explored regions that may contain high-value behaviors. In this paper, we propose data-dependent exploration for preference optimization (DEPO), a simple and scalable method that leverages historical data to construct an extra uncertainty bonus for high-uncertainty regions, encouraging exploration toward potentially high-value data. Theoretically, we provide a data-dependent regret bound for the proposed algorithm, showing that it adapts to the hardness of the learning task itself and can be tighter than worst-case bounds in practice. Empirically, the proposed method consistently outperforms strong baselines across benchmarks, demonstrating improved sample efficiency.

Problem

Research questions and friction points this paper is trying to address.

online reinforcement learning

human feedback

exploration

sample efficiency

preference optimization

Innovation

Methods, ideas, or system contributions that make the work stand out.

data-dependent exploration

online reinforcement learning

human feedback