Reasoning Model Unlearning: Forgetting Traces, Not Just Answers, While Preserving Reasoning Skills

📅 2025-06-15

📈 Citations: 0

✨ Influential: 0

career value

175K/year

🤖 AI Summary

Sensitive information persists in intermediate Chain-of-Thought (CoT) steps of Large Reasoning Models (LRMs), and conventional unlearning methods—focused solely on final outputs—fail to achieve thorough forgetting. Method: We propose R²MU (Reasoning-aware Representation Misdirection), the first targeted unlearning method designed specifically for reasoning trajectories. R²MU jointly optimizes CoT paths via representation-space perturbation and gradient constraints, precisely suppressing sensitive reasoning steps while preserving multi-step reasoning capability. It introduces a safety-capability co-objective for end-to-end controllable unlearning. Results: Evaluated on the DeepSeek-R1-Distill series, R²MU reduces sensitive information leakage significantly, improves safety metrics by over 40%, and maintains reasoning accuracy above 98%. It is the first method to achieve CoT-level knowledge unloading without compromising strong reasoning performance.

Technology Category

Application Category

📝 Abstract

Recent advances in large reasoning models (LRMs) have enabled strong chain-of-thought (CoT) generation through test-time computation. While these multi-step reasoning capabilities represent a major milestone in language model performance, they also introduce new safety risks. In this work, we present the first systematic study to revisit the problem of machine unlearning in the context of LRMs. Machine unlearning refers to the process of removing the influence of sensitive, harmful, or undesired data or knowledge from a trained model without full retraining. We show that conventional unlearning algorithms, originally designed for non-reasoning models, are inadequate for LRMs. In particular, even when final answers are successfully erased, sensitive information often persists within the intermediate reasoning steps, i.e., CoT trajectories. To address this challenge, we extend conventional unlearning and propose Reasoning-aware Representation Misdirection for Unlearning ($R^2MU$), a novel method that effectively suppresses sensitive reasoning traces and prevents the generation of associated final answers, while preserving the model's reasoning ability. Our experiments demonstrate that $R^2MU$ significantly reduces sensitive information leakage within reasoning traces and achieves strong performance across both safety and reasoning benchmarks, evaluated on state-of-the-art models such as DeepSeek-R1-Distill-LLaMA-8B and DeepSeek-R1-Distill-Qwen-14B.

Problem

Research questions and friction points this paper is trying to address.

Unlearning sensitive data in large reasoning models

Preserving reasoning skills while removing harmful traces

Addressing leakage of sensitive information in reasoning steps

Innovation

Methods, ideas, or system contributions that make the work stand out.

Extends unlearning to reasoning model traces

Proposes Reasoning-aware Representation Misdirection

Preserves reasoning skills while removing sensitive data

🔎 Similar Papers

Distributional Associations vs In-Context Reasoning: A Study of Feed-forward and Attention Layers