Soft Token Attacks Cannot Reliably Audit Unlearning in Large Language Models

📅 2025-02-20

📈 Citations: 0

✨ Influential: 0

career value

192K/year

🤖 AI Summary

Soft-token attacks (STAs) are widely misused as evaluation tools for machine unlearning in large language models (LLMs), despite lacking causal validity. Method: We conduct the first systematic analysis demonstrating that STAs function not as information-extraction mechanisms but as overly aggressive interference attacks—capable of reliably generating >400-character irrelevant or hallucinated outputs using only 1–10 soft tokens, irrespective of whether target knowledge was ever trained or subsequently unlearned. Contribution/Results: We formally challenge the prevailing unlearning evaluation paradigm by proving STA’s fundamental inadequacy as an audit signal; we advocate abandoning STA as a baseline and propose developing rigorously grounded, causally interpretable benchmarks. Experiments across standard unlearning benchmarks—including *Who Is Harry Potter?* and TOFU—confirm the robustness and generalizability of our findings across diverse unlearning algorithms and data configurations.

Technology Category

Application Category

📝 Abstract

Large language models (LLMs) have become increasingly popular. Their emergent capabilities can be attributed to their massive training datasets. However, these datasets often contain undesirable or inappropriate content, e.g., harmful texts, personal information, and copyrighted material. This has promoted research into machine unlearning that aims to remove information from trained models. In particular, approximate unlearning seeks to achieve information removal by strategically editing the model rather than complete model retraining. Recent work has shown that soft token attacks (STA) can successfully extract purportedly unlearned information from LLMs, thereby exposing limitations in current unlearning methodologies. In this work, we reveal that STAs are an inadequate tool for auditing unlearning. Through systematic evaluation on common unlearning benchmarks (Who Is Harry Potter? and TOFU), we demonstrate that such attacks can elicit any information from the LLM, regardless of (1) the deployed unlearning algorithm, and (2) whether the queried content was originally present in the training corpus. Furthermore, we show that STA with just a few soft tokens (1-10) can elicit random strings over 400-characters long. Thus showing that STAs are too powerful, and misrepresent the effectiveness of the unlearning methods. Our work highlights the need for better evaluation baselines, and more appropriate auditing tools for assessing the effectiveness of unlearning in LLMs.

Problem

Research questions and friction points this paper is trying to address.

Soft token attacks

Auditing unlearning effectiveness

Large language models

Innovation

Methods, ideas, or system contributions that make the work stand out.

Soft token attacks audit inadequately

Systematic evaluation reveals limitations

Need better unlearning assessment tools

🔎 Similar Papers

Unlearning or Obfuscating? Jogging the Memory of Unlearned LLMs via Benign Relearning