🤖 AI Summary
This work addresses the challenge of manually designing reward functions in reinforcement learning by proposing R4, a method that automatically learns reward functions from discrete human ratings (e.g., “poor,” “medium,” “good”) assigned to trajectories. R4 introduces a differentiable ranking operator to produce soft rankings and formulates a ranking mean squared error (rMSE) loss function. Notably, it provides the first theoretical guarantees—specifically, minimality and completeness—on the solution set for rating-based reward learning. Empirical evaluations on robotic control tasks from OpenAI Gym and DeepMind Control Suite demonstrate that R4 achieves performance on par with or superior to existing rating- and preference-based methods, using only a small amount of human feedback.
📝 Abstract
Reward design remains a significant bottleneck in applying reinforcement learning (RL) to real-world problems. A popular alternative is reward learning, where reward functions are inferred from human feedback rather than manually specified. Recent work has proposed learning reward functions from human feedback in the form of ratings, rather than traditional binary preferences, enabling richer and potentially less cognitively demanding supervision. Building on this paradigm, we introduce a new rating-based RL method, Ranked Return Regression for RL (R4). At its core, R4 employs a novel ranking mean squared error (rMSE) loss, which treats teacher-provided ratings as ordinal targets. Our approach learns from a dataset of trajectory-rating pairs, where each trajectory is labeled with a discrete rating (e.g.,"bad,""neutral,""good"). At each training step, we sample a set of trajectories, predict their returns, and rank them using a differentiable sorting operator (soft ranks). We then optimize a mean squared error loss between the resulting soft ranks and the teacher's ratings. Unlike prior rating-based approaches, R4 offers formal guarantees: its solution set is provably minimal and complete under mild assumptions. Empirically, using simulated human feedback, we demonstrate that R4 consistently matches or outperforms existing rating and preference-based RL methods on robotic locomotion benchmarks from OpenAI Gym and the DeepMind Control Suite, while requiring significantly less feedback.