General Exploratory Bonus for Optimistic Exploration in RLHF

📅 2025-09-27

📈 Citations: 0

✨ Influential: 0

career value

214K/year

🤖 AI Summary

In RLHF, existing KL- or α-divergence-based exploration incentives suffer from regularization bias, steering policies toward high-probability regions of the reference model and resulting in overly conservative behavior and insufficient exploration. To address this, we propose Generalized Exploration Bonus (GEB), the first theoretically grounded method ensuring optimistic exploration: GEB explicitly counteracts divergence-induced bias via reference-dependent reward shaping, unifying and generalizing multiple heuristic exploration rewards. Crucially, GEB is compatible with the full α-divergence family of regularizers, enabling flexible exploration mechanism design. Extensive experiments across diverse alignment tasks—including instruction following, math reasoning, and safety alignment—and across language models of varying scales demonstrate that GEB significantly improves both sample efficiency and final performance. These results empirically validate the theoretical guarantees of GEB, confirming its effectiveness and practical utility in real-world RLHF settings.

Technology Category

Application Category

📝 Abstract

Optimistic exploration is central to improving sample efficiency in reinforcement learning with human feedback, yet existing exploratory bonus methods to incentivize exploration often fail to realize optimism. We provide a theoretical analysis showing that current formulations, under KL or $alpha$-divergence regularization, unintentionally bias exploration toward high-probability regions of the reference model, thereby reinforcing conservative behavior instead of promoting discovery of uncertain regions. To address this pitfall, we introduce the General Exploratory Bonus (GEB), a novel theoretical framework that provably satisfies the optimism principle. GEB counteracts divergence-induced bias via reference-dependent reward regulation and unifies prior heuristic bonuses as special cases, while extending naturally across the full $alpha$-divergence family. Empirically, GEB consistently outperforms baselines on alignment tasks across multiple divergence settings and large language model backbones. These results demonstrate that GEB offers both a principled and practical solution for optimistic exploration in RLHF.

Problem

Research questions and friction points this paper is trying to address.

Addresses biased exploration in RLHF due to divergence regularization

Introduces General Exploratory Bonus framework for optimistic exploration

Improves alignment performance across divergence settings and model architectures

Innovation

Methods, ideas, or system contributions that make the work stand out.

GEB framework counteracts divergence-induced bias in RLHF

GEB unifies prior heuristic bonuses as special cases

GEB extends naturally across full α-divergence family

🔎 Similar Papers

Random Latent Exploration for Deep Reinforcement Learning