🤖 AI Summary
Reinforcement learning (RL) training is inherently unstable, especially in RLHF and RLAIF, where reward model inaccuracies and preference heterogeneity exacerbate alignment challenges. To address this, we propose a symmetric RL loss—adapting inverse cross-entropy (RCE) from supervised learning into an RL objective for the first time. Theoretically, this loss reduces gradient variance and improves robustness to target network updates, while naturally enabling multi-task and multi-scale generalization. Building upon it, we design two algorithms: SA2C and SPPO, incorporating noise injection, layer normalization, and robust reward modeling. Empirical evaluation across Atari, MuJoCo, Box2D, and LLM-based RLHF tasks—including IMDB sentiment generation and TL;DR summarization—demonstrates substantial improvements in training stability and final performance. Notably, SPPO exhibits exceptional robustness under hyperparameter perturbations.
📝 Abstract
Reinforcement learning (RL) training is inherently unstable due to factors such as moving targets and high gradient variance. Reinforcement Learning from Human Feedback (RLHF) and Reinforcement Learning from AI Feedback (RLAIF) can introduce additional difficulty. Differing preferences can complicate the alignment process, and prediction errors in a trained reward model can become more severe as the LLM generates unseen outputs. To enhance training robustness, RL has adopted techniques from supervised learning, such as ensembles and layer normalization. In this work, we improve the stability of RL training by adapting the reverse cross entropy (RCE) from supervised learning for noisy data to define a symmetric RL loss. We demonstrate performance improvements across various tasks and scales. We conduct experiments in discrete action tasks (Atari games) and continuous action space tasks (MuJoCo benchmark and Box2D) using Symmetric A2C (SA2C) and Symmetric PPO (SPPO), with and without added noise with especially notable performance in SPPO across different hyperparameters. Furthermore, we validate the benefits of the symmetric RL loss when using SPPO for large language models through improved performance in RLHF tasks, such as IMDB positive sentiment sentiment and TL;DR summarization tasks.