Regularized Online RLHF with Generalized Bilinear Preferences

📅 2026-02-26

📈 Citations: 0

✨ Influential: 0

career value

221K/year

🤖 AI Summary

This work addresses the problem of identifying Nash equilibria in contextual online reinforcement learning from human feedback (RLHF) under general—potentially non-transitive—preference structures. It proposes the first online RLHF framework that accommodates arbitrary strongly convex regularizers, thereby overcoming the prior limitation to reverse KL regularization. Building upon a generalized bilinear preference model (GBPM) and leveraging assumptions of low-rank skew-symmetric structure and feature diversity, the authors design both a greedy sampling strategy and an Explore-Then-Commit algorithm. Theoretical analysis establishes a quadratic relationship between duality gap and estimation error, yielding a polylogarithmic regret bound of Õ(ηd⁴(log T)²) without exponential dependence on problem parameters. Moreover, in high-dimensional settings, the framework achieves the first statistically efficient regret bound of Õ(√(ηrT)).

Technology Category

Application Category

📝 Abstract

We consider the problem of contextual online RLHF with general preferences, where the goal is to identify the Nash Equilibrium. We adopt the Generalized Bilinear Preference Model (GBPM) to capture potentially intransitive preferences via low-rank, skew-symmetric matrices. We investigate general preference learning with any strongly convex regularizer (where $η^{-1}$ is the regularization strength), generalizing beyond prior works limited to reverse KL-regularization. Central to our analysis is proving that the dual gap of the greedy policy is bounded by the square of the estimation error - a result derived solely from strong convexity and the skew-symmetricity of GBPM.Building on this insight and a feature diversity assumption, we establish two regret bounds via two simple algorithms: (1) Greedy Sampling achieves polylogarithmic, $e^{O(η)}$-free regret $\tilde{O}(ηd^4 (\log T)^2)$. (2) Explore-Then-Commit achieves $\mathrm{poly}(d)$-free regret $\tilde{O}(\sqrt{ηr T})$ by exploiting the low-rank structure; this is the first statistically efficient guarantee for online RLHF in high-dimensions.

Problem

Research questions and friction points this paper is trying to address.

online RLHF

general preferences

Nash Equilibrium

intransitive preferences

contextual bandits

Innovation

Methods, ideas, or system contributions that make the work stand out.

Generalized Bilinear Preference Model

Online RLHF

Strongly Convex Regularization