DRAGON: Guard LLM Unlearning in Context via Negative Detection and Reasoning

📅 2025-11-08

📈 Citations: 1

✨ Influential: 0

career value

157K/year

🤖 AI Summary

To address privacy preservation and harmful knowledge erasure in large language models (LLMs) under data-constrained settings, this paper proposes DRAGON: a novel, data-free, inference-time contextual unlearning framework that requires no access to original training data. Methodologically, DRAGON integrates a lightweight negative-prompt detection module with a chain-of-thought (CoT)-guided safety intervention mechanism, enabling dynamic identification and suppression of target content during forward inference. Its key contributions include: (i) the first inference-only unlearning framework eliminating reliance on training data; (ii) novel unlearning evaluation metrics and a continuous unlearning setting; and (iii) guaranteed unlearning accuracy and safety without compromising general language capabilities. Extensive experiments across three representative unlearning tasks demonstrate DRAGON’s significant superiority over baselines, while exhibiting strong scalability and practical deployability.

Technology Category

Application Category

📝 Abstract

Unlearning in Large Language Models (LLMs) is crucial for protecting private data and removing harmful knowledge. Most existing approaches rely on fine-tuning to balance unlearning efficiency with general language capabilities. However, these methods typically require training or access to retain data, which is often unavailable in real world scenarios. Although these methods can perform well when both forget and retain data are available, few works have demonstrated equivalent capability in more practical, data-limited scenarios. To overcome these limitations, we propose Detect-Reasoning Augmented GeneratiON (DRAGON), a systematic, reasoning-based framework that utilizes in-context chain-of-thought (CoT) instructions to guard deployed LLMs before inference. Instead of modifying the base model, DRAGON leverages the inherent instruction-following ability of LLMs and introduces a lightweight detection module to identify forget-worthy prompts without any retain data. These are then routed through a dedicated CoT guard model to enforce safe and accurate in-context intervention. To robustly evaluate unlearning performance, we introduce novel metrics for unlearning performance and the continual unlearning setting. Extensive experiments across three representative unlearning tasks validate the effectiveness of DRAGON, demonstrating its strong unlearning capability, scalability, and applicability in practical scenarios.

Problem

Research questions and friction points this paper is trying to address.

Enabling LLM unlearning without training data or model modifications

Detecting harmful prompts through lightweight modules and reasoning guards

Evaluating unlearning performance with novel metrics in practical scenarios

Innovation

Methods, ideas, or system contributions that make the work stand out.

Lightweight detection module identifies forget-worthy prompts

In-context chain-of-thought instructions guard models before inference

Reasoning-based framework enables unlearning without retain data

🔎 Similar Papers

No similar papers found.