Scalable and Robust LLM Unlearning by Correcting Responses with Retrieved Exclusions

📅 2025-09-30

📈 Citations: 0

✨ Influential: 0

career value

182K/year

🤖 AI Summary

Large language models (LLMs) are prone to memorizing and leaking sensitive information during training; existing unlearning methods predominantly rely on model fine-tuning, suffering from poor robustness and limited scalability. This paper proposes CURE, a training-free, dynamic knowledge unlearning framework that operates entirely at inference time. CURE employs retrieval-augmented detection to identify potential leakage triggers, then applies a lightweight corrector and conditional rewriting mechanism to detect and safely rewrite sensitive outputs in real time. Its core innovation lies in decoupling unlearning into a three-stage pipeline—retrieval, verification, and rewriting—enabling scalable, continuous unlearning under diverse and evolving requests. Experiments demonstrate that CURE significantly reduces both direct and indirect leakage rates compared to state-of-the-art baselines, while preserving generation quality and general-purpose model capabilities.

Technology Category

Application Category

📝 Abstract

Language models trained on web-scale corpora risk memorizing and exposing sensitive information, prompting the need for effective machine unlearning. Prior methods mainly focus on input queries to suppress sensitive outputs, yet this often fails to eliminate the underlying knowledge and limits scalability. To address this, we propose Corrective Unlearning with Retrieved Exclusions (CURE), a novel unlearning framework that verifies model outputs for leakage and revises them into safe responses. Specifically, CURE employs a lightweight corrector that is applied to the original model to verify whether outputs contain target knowledge and to rewrite them if any leakage is detected. To efficiently handle large-scale unlearning requests, CURE retrieves unlearning targets that are relevant to the initial response and provides them as in-context references to the corrector for detection and conditional revision. By leveraging this retrieval augmentation, the corrector can adapt to new unlearning requests without additional training. Extensive evaluations demonstrate that CURE substantially reduces information leakage, even from indirect queries where prior works fall short, while maintaining response quality and general utility. Moreover, it demonstrates robustness under continual unlearning scenarios, making it practical for real-world applications.

Problem

Research questions and friction points this paper is trying to address.

Preventing language models from leaking sensitive memorized information

Enhancing scalability and robustness in machine unlearning systems

Correcting model outputs containing target knowledge without retraining

Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses retrieval-augmented corrector for leakage verification

Rewrites unsafe responses using retrieved exclusions as context

Enables scalable unlearning without additional model training

🔎 Similar Papers

A Closer Look at Machine Unlearning for Large Language Models