LLM Unlearning Reveals a Stronger-Than-Expected Coreset Effect in Current Benchmarks

📅 2025-04-14

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

This work identifies a previously overlooked “core-set effect” in large language model (LLM) unlearning: on mainstream benchmarks (e.g., WMDP, MUSE), unlearning from just 5% randomly sampled data achieves comparable forgetting fidelity to full-dataset unlearning. Method: We systematically validate this phenomenon and propose a keyword-driven mechanism—where forgetting efficacy stems primarily from high-impact tokens rather than dataset size. Leveraging state-of-the-art unlearning algorithms (e.g., NPO, RMU), we conduct randomized and heuristic subset sampling, keyword extraction, mode connectivity analysis, and adversarial robustness evaluation. Results: Core-set unlearning consistently preserves high forgetting fidelity across diverse methods; the resulting models exhibit enhanced robustness against jailbreaking attacks while maintaining strong mode connectivity with the original model. This challenges prevailing unlearning evaluation paradigms and offers a new pathway toward efficient, interpretable, and token-aware model editing.

Technology Category

Application Category

📝 Abstract

Large language model unlearning has become a critical challenge in ensuring safety and controlled model behavior by removing undesired data-model influences from the pretrained model while preserving general utility. Significant recent efforts have been dedicated to developing LLM unlearning benchmarks such as WMDP (Weapons of Mass Destruction Proxy) and MUSE (Machine Unlearning Six-way Evaluation), facilitating standardized unlearning performance assessment and method comparison. Despite their usefulness, we uncover for the first time a novel coreset effect within these benchmarks. Specifically, we find that LLM unlearning achieved with the original (full) forget set can be effectively maintained using a significantly smaller subset (functioning as a"coreset"), e.g., as little as 5% of the forget set, even when selected at random. This suggests that LLM unlearning in these benchmarks can be performed surprisingly easily, even in an extremely low-data regime. We demonstrate that this coreset effect remains strong, regardless of the LLM unlearning method used, such as NPO (Negative Preference Optimization) and RMU (Representation Misdirection Unlearning), the popular ones in these benchmarks. The surprisingly strong coreset effect is also robust across various data selection methods, ranging from random selection to more sophisticated heuristic approaches. We explain the coreset effect in LLM unlearning through a keyword-based perspective, showing that keywords extracted from the forget set alone contribute significantly to unlearning effectiveness and indicating that current unlearning is driven by a compact set of high-impact tokens rather than the entire dataset. We further justify the faithfulness of coreset-unlearned models along additional dimensions, such as mode connectivity and robustness to jailbreaking attacks. Codes are available at https://github.com/OPTML-Group/MU-Coreset.

Problem

Research questions and friction points this paper is trying to address.

LLM unlearning benchmarks reveal unexpected coreset effect

Small subsets effectively maintain unlearning performance in benchmarks

Current unlearning driven by high-impact tokens, not full datasets

Innovation

Methods, ideas, or system contributions that make the work stand out.

LLM unlearning uses small coreset for efficiency

Keywords drive unlearning more than full dataset

Coreset effect robust across methods and selections

🔎 Similar Papers

Towards Effective Evaluations and Comparisons for LLM Unlearning Methods

2024-06-13Citations: 6

Position: LLM Unlearning Benchmarks are Weak Measures of Progress

2024-10-03arXiv.orgCitations: 5

Unlearning or Obfuscating? Jogging the Memory of Unlearned LLMs via Benign Relearning

2024-06-19Citations: 4

Authors to Follow