Unlearning What Matters: Token-Level Attribution for Precise Language Model Unlearning

📅 2026-04-30

📈 Citations: 0

✨ Influential: 0

career value

151K/year

🤖 AI Summary

Existing unlearning methods for large language models update uniformly at the sequence level, making it difficult to precisely identify key tokens associated with knowledge to be removed, which results in high gradient noise and significant utility loss. This work proposes TokenUnlearn, a novel framework that, for the first time, refines unlearning operations to the token level. It computes token importance scores using knowledge-aware and entropy-aware signals and introduces two strategies—hard selection and soft weighting—to enable selective forgetting of critical tokens. By integrating mask-guided training, token-level gradient modulation, and established unlearning mechanisms, TokenUnlearn consistently outperforms sequence-level approaches across three mainstream model architectures on the TOFU and WMDP benchmarks, achieving higher unlearning accuracy while better preserving model utility.

📝 Abstract

Machine unlearning has emerged as a critical capability for addressing privacy, safety, and regulatory concerns in large language models (LLMs). Existing methods operate at the sequence level, applying uniform updates across all tokens despite only a subset encoding the knowledge targeted for removal. This introduces gradient noise, degrades utility, and leads to suboptimal forgetting. We propose TokenUnlearn, a token-level attribution framework that identifies and selectively targets critical tokens. Our approach combines knowledge-aware signals via masking, and entropy-aware signals to yield importance scores for precise token selection. We develop two complementary strategies: hard selection, applying unlearning only to high-importance tokens, and soft weighting, modulating gradient contributions based on importance scores. Both extend existing methods to token-level variants. Theoretical analysis shows token-level selection improves gradient signal-to-noise ratio. Experiments on TOFU and WMDP benchmarks across three model architectures demonstrate consistent improvements over sequence-level baselines in both forgetting effectiveness and utility preservation.

Problem

Research questions and friction points this paper is trying to address.

machine unlearning

token-level attribution

large language models

forgetting effectiveness

gradient noise

Innovation

Methods, ideas, or system contributions that make the work stand out.

token-level unlearning

attribution

gradient signal-to-noise ratio