RULE: Reinforcement UnLEarning Achieves Forget-Retain Pareto Optimality

๐Ÿ“… 2025-06-08
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
Selective forgetting of sensitive, copyrighted, or illegal content in large language models (LLMs) remains challenging without costly retraining. Method: This paper proposes a training-free rejection boundary optimization framework that formulates forgetting as a boundary learning problem. Leveraging only a small set of authentic forget samples (12%) and lightweight synthetic queries (8%), it jointly optimizes a verifiable forgetโ€“retain reward function via reinforcement learning to achieve Pareto-optimal trade-offs. Contribution/Results: The method enables semantic-generalizable rejection while preserving response naturalness (+16.3%) and forgetting quality (+17.5%), with zero degradation in general capabilities. It demonstrates strong generalization to unseen semantically related queries. Crucially, it introduces the first verifiable reward mechanism for forgetting, significantly improving training efficiency and controllability of the forgetting process.

Technology Category

Application Category

๐Ÿ“ Abstract
The widespread deployment of Large Language Models (LLMs) trained on massive, uncurated corpora has raised growing concerns about the inclusion of sensitive, copyrighted, or illegal content. This has led to increasing interest in LLM unlearning: the task of selectively removing specific information from a model without retraining from scratch or degrading overall utility. However, existing methods often rely on large-scale forget and retain datasets, and suffer from unnatural responses, poor generalization, or catastrophic utility loss. In this work, we propose Reinforcement UnLearning (RULE), an efficient framework that formulates unlearning as a refusal boundary optimization problem. RULE is trained with a small portion of the forget set and synthesized boundary queries, using a verifiable reward function that encourages safe refusal on forget--related queries while preserving helpful responses on permissible inputs. We provide both theoretical and empirical evidence demonstrating the effectiveness of RULE in achieving targeted unlearning without compromising model utility. Experimental results show that, with only $12%$ forget set and $8%$ synthesized boundary data, RULE outperforms existing baselines by up to $17.5%$ forget quality and $16.3%$ naturalness response while maintaining general utility, achieving forget--retain Pareto optimality. Remarkably, we further observe that RULE improves the naturalness of model outputs, enhances training efficiency, and exhibits strong generalization ability, generalizing refusal behavior to semantically related but unseen queries.
Problem

Research questions and friction points this paper is trying to address.

Selectively remove sensitive content from LLMs without retraining
Optimize refusal boundary to balance forget-retain performance
Improve unlearning efficiency with minimal data and enhanced generalization
Innovation

Methods, ideas, or system contributions that make the work stand out.

RULE optimizes refusal boundary for unlearning
Uses verifiable reward for safe refusal
Achieves Pareto optimality with minimal data
๐Ÿ”Ž Similar Papers
No similar papers found.
Chenlong Zhang
Chenlong Zhang
Institute of Automation, Chinese Academy of Sciences
Natural Language ProcessingLarge Language Models
Zhuoran Jin
Zhuoran Jin
Institute of Automation, Chinese Academy of Sciences
Large Language ModelsNatural Language ProcessingKnowledge Engineering
Hongbang Yuan
Hongbang Yuan
Institute of Automation, Chinese Academy of Sciences
Large Language ModelsNatural Language Processing
J
Jiaheng Wei
The Hong Kong University of Science and Technology (Guangzhou)
T
Tong Zhou
Institute of Automation, Chinese Academy of Sciences, School of Artificial Intelligence, University of Chinese Academy of Sciences, Beijing, China
K
Kang Liu
Institute of Automation, Chinese Academy of Sciences, School of Artificial Intelligence, University of Chinese Academy of Sciences, Beijing, China
J
Jun Zhao
Institute of Automation, Chinese Academy of Sciences, School of Artificial Intelligence, University of Chinese Academy of Sciences, Beijing, China
Yubo Chen
Yubo Chen
Institute of Automation, Chinese Academy of Sciences
Natural Language ProcessingInformation ExtractionEvent ExtractionLarge Language Model