LLM Unlearning on Noisy Forget Sets: A Study of Incomplete, Rewritten, and Watermarked Data

📅 2025-10-10

📈 Citations: 0

✨ Influential: 0

career value

188K/year

🤖 AI Summary

This work addresses the practical challenge in large language model (LLM) unlearning where forgotten data often consist of low-quality, rewritten, or watermarked noisy samples—a scenario largely unexplored in prior studies. Method: We introduce a semantic-saliency-based interpretability framework to systematically investigate knowledge unlearning robustness under noisy forgetting sets, revealing that state-of-the-art methods (e.g., RMU, NPO) rely primarily on deep semantic features rather than surface-level lexical cues for forgetting. Contribution/Results: Experiments demonstrate that mainstream unlearning methods exhibit unexpected robustness to diverse noise types—provided core semantic content remains intact—and that semantically critical components maintain stable influence throughout the unlearning process. This study fills a critical gap in understanding LLM unlearning reliability under realistic conditions, providing both theoretical insights and empirical evidence to support the development of more trustworthy model update mechanisms.

Technology Category

Application Category

📝 Abstract

Large language models (LLMs) exhibit remarkable generative capabilities but raise ethical and security concerns by memorizing sensitive data, reinforcing biases, and producing harmful content. These risks have spurred interest in LLM unlearning, the task of removing knowledge associated with undesirable data from pre-trained models. However, most existing methods assume access to clean, well-defined forget data samples, whereas real-world forget data could often be low-quality, synthetically rewritten, or watermarked, casting doubt on the reliability of unlearning. This work presents the first study of unlearning under perturbed or low-fidelity forget data, referred to as noisy forget sets. By systematically benchmarking state-of-the-art LLM unlearning methods, RMU and NPO, on such noisy forget sets, we find that unlearning remains surprisingly robust to perturbations, provided that core semantic signals are preserved. To explain this robustness, we propose a saliency-based interpretation: key semantic components that drive forgetting remain consistently influential despite substantial variation in surface form. This suggests that unlearning algorithms are primarily guided by deep semantic cues rather than shallow lexical patterns.

Problem

Research questions and friction points this paper is trying to address.

Studying LLM unlearning robustness on noisy forget sets

Benchmarking unlearning methods with perturbed training data

Analyzing semantic preservation impact on knowledge removal effectiveness

Innovation

Methods, ideas, or system contributions that make the work stand out.

Unlearning with noisy forget sets

Robust to semantic-preserving perturbations

Saliency-based interpretation of unlearning robustness

🔎 Similar Papers

Unlearning or Obfuscating? Jogging the Memory of Unlearned LLMs via Benign Relearning