Fusing Rewards and Preferences in Reinforcement Learning

📅 2025-08-15

📈 Citations: 0

✨ Influential: 0

career value

209K/year

🤖 AI Summary

This paper addresses the challenges of jointly leveraging individual reward signals and pairwise preference feedback in reinforcement learning—namely, difficulty in fusing heterogeneous signals and training instability. We propose the Dual-Feedback Actor (DFA) framework, which unifies these two feedback modalities: it directly parameterizes Bradley–Terry preference probabilities using policy log-probabilities, bypassing explicit reward modeling; and it generates preference samples online via off-policy Q-value estimates, supporting both human annotations and synthetic data. Theoretically, we prove that DFA’s policy update is equivalent to that of entropy-regularized Soft Actor-Critic (SAC), establishing, for the first time, a rigorous unification of reward- and preference-based learning within a single policy gradient framework. Empirically, DFA matches or surpasses SAC across six continuous control benchmarks while significantly reducing training variance; on a stochastic GridWorld task, it outperforms classical RLHF methods and approaches the performance of a true reward oracle.

Technology Category

Application Category

📝 Abstract

We present Dual-Feedback Actor (DFA), a reinforcement learning algorithm that fuses both individual rewards and pairwise preferences (if available) into a single update rule. DFA uses the policy's log-probabilities directly to model the preference probability, avoiding a separate reward-modeling step. Preferences can be provided by human-annotators (at state-level or trajectory-level) or be synthesized online from Q-values stored in an off-policy replay buffer. Under a Bradley-Terry model, we prove that minimizing DFA's preference loss recovers the entropy-regularized Soft Actor-Critic (SAC) policy. Our simulation results show that DFA trained on generated preferences matches or exceeds SAC on six control environments and demonstrates a more stable training process. With only a semi-synthetic preference dataset under Bradley-Terry model, our algorithm outperforms reward-modeling reinforcement learning from human feedback (RLHF) baselines in a stochastic GridWorld and approaches the performance of an oracle with true rewards.

Problem

Research questions and friction points this paper is trying to address.

Fuses rewards and preferences in reinforcement learning

Avoids separate reward-modeling step using policy log-probabilities

Outperforms baselines with semi-synthetic preference datasets

Innovation

Methods, ideas, or system contributions that make the work stand out.

Fuses rewards and preferences in RL updates

Uses policy log-probabilities for preference modeling

Incorporates human or synthetic Q-value preferences

🔎 Similar Papers

REBEL: Reward Regularization-Based Approach for Robotic Reinforcement Learning from Human Feedback