Fusing Rewards and Preferences in Reinforcement Learning

📅 2025-08-15
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This paper addresses the challenges of jointly leveraging individual reward signals and pairwise preference feedback in reinforcement learning—namely, difficulty in fusing heterogeneous signals and training instability. We propose the Dual-Feedback Actor (DFA) framework, which unifies these two feedback modalities: it directly parameterizes Bradley–Terry preference probabilities using policy log-probabilities, bypassing explicit reward modeling; and it generates preference samples online via off-policy Q-value estimates, supporting both human annotations and synthetic data. Theoretically, we prove that DFA’s policy update is equivalent to that of entropy-regularized Soft Actor-Critic (SAC), establishing, for the first time, a rigorous unification of reward- and preference-based learning within a single policy gradient framework. Empirically, DFA matches or surpasses SAC across six continuous control benchmarks while significantly reducing training variance; on a stochastic GridWorld task, it outperforms classical RLHF methods and approaches the performance of a true reward oracle.

Technology Category

Application Category

📝 Abstract
We present Dual-Feedback Actor (DFA), a reinforcement learning algorithm that fuses both individual rewards and pairwise preferences (if available) into a single update rule. DFA uses the policy's log-probabilities directly to model the preference probability, avoiding a separate reward-modeling step. Preferences can be provided by human-annotators (at state-level or trajectory-level) or be synthesized online from Q-values stored in an off-policy replay buffer. Under a Bradley-Terry model, we prove that minimizing DFA's preference loss recovers the entropy-regularized Soft Actor-Critic (SAC) policy. Our simulation results show that DFA trained on generated preferences matches or exceeds SAC on six control environments and demonstrates a more stable training process. With only a semi-synthetic preference dataset under Bradley-Terry model, our algorithm outperforms reward-modeling reinforcement learning from human feedback (RLHF) baselines in a stochastic GridWorld and approaches the performance of an oracle with true rewards.
Problem

Research questions and friction points this paper is trying to address.

Fuses rewards and preferences in reinforcement learning
Avoids separate reward-modeling step using policy log-probabilities
Outperforms baselines with semi-synthetic preference datasets
Innovation

Methods, ideas, or system contributions that make the work stand out.

Fuses rewards and preferences in RL updates
Uses policy log-probabilities for preference modeling
Incorporates human or synthetic Q-value preferences
🔎 Similar Papers
No similar papers found.
S
Sadegh Khorasani
School of Computer and Communication Sciences, EPFL, Lausanne, Switzerland
Saber Salehkaleybar
Saber Salehkaleybar
Leiden University
Causal InferenceStochastic OptimizationReinforcement Learning
Negar Kiyavash
Negar Kiyavash
École polytechnique fédérale de Lausanne (EPFL)
causalityapplied probabilitynetwork forensicsrandom graphstime series
M
Matthias Grossglauser
School of Computer and Communication Sciences, EPFL, Lausanne, Switzerland