🤖 AI Summary
This work addresses the challenges of instability and lack of last-iterate convergence guarantees in existing reinforcement learning from human feedback (RLHF) methods under parameterized policies, particularly in multi-objective safety alignment settings. The authors propose a unified primal-dual framework that encompasses mainstream safe alignment algorithms and introduce an Optimistic Primal-Dual (OPD) algorithm, which stabilizes saddle-point dynamics through predictive updates. For the first time, the paper establishes last-iterate convergence guarantees for safe large language model (LLM) alignment under parameterized policies, highlighting the critical role of optimism in mitigating oscillations between competing objectives and bridging the gap between theory and practice. Under exact optimization, last-iterate convergence is rigorously proven; under parameterized settings, the method converges to a neighborhood of the optimal solution, with error bounds determined by approximation errors and estimation bias, thereby significantly enhancing training stability.
📝 Abstract
Reinforcement Learning from Human Feedback (RLHF) plays a significant role in aligning Large Language Models (LLMs) with human preferences. While RLHF with expected reward constraints can be formulated as a primal-dual optimization problem, standard primal-dual methods only guarantee convergence with a distributional policy where the saddle-point problem is in convex-concave form. Moreover, standard primal-dual methods may exhibit instability or divergence in the last iterate under policy parameterization in practical applications. In this work, we propose a universal primal-dual framework for safe RLHF that unifies a broad class of existing alignment algorithms, including safe-RLHF, one-shot, and multi-shot based methods. Building on this framework, we introduce an optimistic primal-dual (OPD) algorithm that incorporates predictive updates for both primal and dual variables to stabilize saddle-point dynamics. We establish last-iterate convergence guarantees for the proposed method, covering both exact policy optimization in the distributional space and convergence to a neighborhood of the optimal solution whose gap is related to approximation error and bias under parameterized policies. Our analysis reveals that optimism plays a crucial role in mitigating oscillations inherent to constrained alignment objectives, thereby closing a key theoretical gap between constrained RL and practical RLHF.