Reinforcement Unlearning via Group Relative Policy Optimization

📅 2026-01-28

📈 Citations: 0

✨ Influential: 0

career value

146K/year

🤖 AI Summary

This work addresses the challenge of unlearning sensitive or copyrighted data unintentionally memorized by large language models during pretraining, which poses a barrier to compliance with regulatory requirements such as the “right to be forgotten” under GDPR. The study formalizes model unlearning as a verifiable reinforcement learning task and introduces a novel forgetting framework based on Group Relative Policy Optimization. By designing an intrinsic reward mechanism that penalizes any mention of target concepts, the method achieves efficient and secure information removal without relying on external reward models. Experimental results on the RWKU benchmark demonstrate that the approach attains 11% forgetting effectiveness while preserving 98% of the model’s original utility, reduces usage of target tokens by 46-fold, improves generation fluency by 5.48%, and enhances adversarial robustness by 12.02%.

Technology Category

Application Category

📝 Abstract

During pretraining, LLMs inadvertently memorize sensitive or copyrighted data, posing significant compliance challenges under legal frameworks like the GDPR and the EU AI Act. Fulfilling these mandates demands techniques that can remove information from a deployed model without retraining from scratch. Existing unlearning approaches attempt to address this need, but often leak the very data they aim to erase, sacrifice fluency and robustness, or depend on costly external reward models. We introduce PURGE (Policy Unlearning through Relative Group Erasure), a novel method grounded in the Group Relative Policy Optimization framework that formulates unlearning as a verifiable problem. PURGE uses an intrinsic reward signal that penalizes any mention of forbidden concepts, allowing safe and consistent unlearning. Our approach reduces token usage per target by up to a factor of 46 compared with SotA methods, while improving fluency by 5.48 percent and adversarial robustness by 12.02 percent over the base model. On the Real World Knowledge Unlearning (RWKU) benchmark, PURGE achieves 11 percent unlearning effectiveness while preserving 98 percent of original utility. PURGE shows that framing LLM unlearning as a verifiable task, enables more reliable, efficient, and scalable forgetting, suggesting a promising new direction for unlearning research that combines theoretical guarantees, improved safety, and practical deployment efficiency.

Problem

Research questions and friction points this paper is trying to address.

LLM unlearning

data privacy

compliance

machine unlearning

sensitive data removal

Innovation

Methods, ideas, or system contributions that make the work stand out.

Reinforcement Unlearning

Group Relative Policy Optimization

Intrinsic Reward