Robust Policy Optimization to Prevent Catastrophic Forgetting

📅 2026-02-09

📈 Citations: 0

✨ Influential: 0

career value

220K/year

🤖 AI Summary

This work addresses the catastrophic forgetting of safety behaviors acquired during reinforcement learning from human feedback (RLHF) when large language models undergo downstream fine-tuning. To mitigate this issue, the authors propose Fine-tuning Robust Policy Optimization (FRPO), a novel approach that proactively enhances policy robustness during the RLHF stage. FRPO employs a max-min optimization framework to maximize the worst-case reward within a KL-divergence-constrained policy neighborhood, thereby stabilizing the policy against subsequent fine-tuning. The method incurs no additional computational overhead and consistently alleviates safety degradation across diverse base models and fine-tuning settings—including supervised fine-tuning (SFT) and reinforcement learning—while preserving performance on downstream tasks and mathematical reasoning benchmarks.

Technology Category

Application Category

📝 Abstract

Large language models are commonly trained through multi-stage post-training: first via RLHF, then fine-tuned for other downstream objectives. Yet even small downstream updates can compromise earlier learned behaviors (e.g., safety), exposing a brittleness known as catastrophic forgetting. This suggests standard RLHF objectives do not guarantee robustness to future adaptation. To address it, most prior work designs downstream-time methods to preserve previously learned behaviors. We argue that preventing this requires pre-finetuning robustness: the base policy should avoid brittle high-reward solutions whose reward drops sharply under standard fine-tuning. We propose Fine-tuning Robust Policy Optimization (FRPO), a robust RLHF framework that optimizes reward not only at the current policy, but across a KL-bounded neighborhood of policies reachable by downstream adaptation. The key idea is to ensure reward stability under policy shifts via a max-min formulation. By modifying GRPO, we develop an algorithm with no extra computation, and empirically show it substantially reduces safety degradation across multiple base models and downstream fine-tuning regimes (SFT and RL) while preserving downstream task performance. We further study a math-focused RL setting, demonstrating that FRPO preserves accuracy under subsequent fine-tuning.

Problem

Research questions and friction points this paper is trying to address.

catastrophic forgetting

robust policy optimization

RLHF

downstream fine-tuning

safety degradation

Innovation

Methods, ideas, or system contributions that make the work stand out.

catastrophic forgetting

robust policy optimization

RLHF