Asymmetric REINFORCE for off-Policy Reinforcement Learning: Balancing positive and negative rewards

📅 2025-06-25
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the imbalance between positive and negative reward signals in off-policy reinforcement learning, which hinders alignment performance of large language models (LLMs). We propose an asymmetric policy update mechanism grounded in the advantage function $A = r - V$, theoretically proving that policy improvement is strictly guaranteed when the baseline $V$ lies below the expected reward—motivating prioritized reinforcement of positive rewards. Methodologically, we design an off-policy REINFORCE algorithm with an adjustable baseline, integrating insights from supervised fine-tuning. We validate its efficacy in both bandit environments and LLM inference tasks. Experiments demonstrate significantly improved training stability and convergence speed, outperforming standard off-policy RL baselines on alignment tasks.

Technology Category

Application Category

📝 Abstract
Reinforcement learning (RL) is increasingly used to align large language models (LLMs). Off-policy methods offer greater implementation simplicity and data efficiency than on-policy techniques, but often result in suboptimal performance. In this work, we study the intermediate range of algorithms between off-policy RL and supervised fine-tuning by analyzing a simple off-policy REINFORCE algorithm, where the advantage is defined as $A=r-V$, with $r$ a reward and $V$ some tunable baseline. Intuitively, lowering $V$ emphasizes high-reward samples, while raising it penalizes low-reward ones more heavily. We first provide a theoretical analysis of this off-policy REINFORCE algorithm, showing that when the baseline $V$ lower-bounds the expected reward, the algorithm enjoys a policy improvement guarantee. Our analysis reveals that while on-policy updates can safely leverage both positive and negative signals, off-policy updates benefit from focusing more on positive rewards than on negative ones. We validate our findings experimentally in a controlled stochastic bandit setting and through fine-tuning state-of-the-art LLMs on reasoning tasks.
Problem

Research questions and friction points this paper is trying to address.

Balancing positive and negative rewards in off-policy RL
Improving performance of off-policy REINFORCE algorithms
Aligning large language models using reinforcement learning
Innovation

Methods, ideas, or system contributions that make the work stand out.

Asymmetric REINFORCE for off-policy RL
Balancing positive and negative rewards
Tunable baseline V for reward emphasis
🔎 Similar Papers
No similar papers found.