Learning Kernel-Based MDPs from Episodic Preferential Feedback

📅 2026-05-22

📈 Citations: 0

✨ Influential: 0

career value

218K/year

🤖 AI Summary

This work addresses the challenge of efficiently learning optimal policies in kernelized Markov decision processes (MDPs) when only fragmentary pairwise preference feedback—rather than true rewards—is available. Modeling trajectory preferences via the Bradley–Terry–Luce (BTL) model to implicitly capture reward differences, the paper introduces the first theoretical reinforcement learning framework tailored to kernelized MDPs under pure preference feedback. It develops a preference-based value estimator and a high-probability confidence set construction method aligned with comparisons at fragment endpoints. By jointly modeling the reward and transition functions using kernel methods and integrating the BTL link function with confidence-bound analysis, the study establishes, for the first time, a sublinear high-probability regret bound with respect to the number of fragments, guaranteeing convergence of the learned policy’s value to optimality.

📝 Abstract

Human feedback often arrives as preferences rather than calibrated numeric rewards, motivating reinforcement learning from preferential feedback, also referred to as reinforcement learning from human feedback (RLHF). We present a rigorous theoretical study of preference-only learning in episodic kernel MDPs. In each episode, the learner deploys two policies from a common start state and receives a single binary label indicating which trajectory is preferred, modeled by a Bradley--Terry--Luce link on the difference of cumulative (unobserved) rewards. Under kernel-based assumptions on the reward and transition functions (one of the most general models amenable to theoretical analysis) we develop preference-based value estimation and confidence sets tailored to end-of-episode comparisons.We prove high-probability regret bounds that scale sublinearly in the number of episodes, implying that the value of the learned policy converges to that of the optimal policy.

Problem

Research questions and friction points this paper is trying to address.

preference-based reinforcement learning

episodic MDPs

kernel methods

human feedback

Bradley-Terry-Luce model

Innovation

Methods, ideas, or system contributions that make the work stand out.

preference-based reinforcement learning

kernel MDPs

Bradley-Terry-Luce model