Selective Forgetting for Large Reasoning Models

📅 2026-04-03

📈 Citations: 0

✨ Influential: 0

career value

184K/year

🤖 AI Summary

This work addresses the vulnerability of large reasoning models to leaking sensitive information from their training data during intermediate reasoning steps, a challenge exacerbated by existing unlearning methods that often degrade overall reasoning capabilities. To tackle this issue, the paper introduces the first selective unlearning framework tailored for intermediate steps in large reasoning models. By integrating multilingual models with retrieval-augmented generation (RAG), the approach precisely identifies sensitive segments within reasoning chains and employs a novel feature-replacement unlearning loss function. This mechanism effectively suppresses the generation of sensitive content while preserving the logical integrity of the reasoning process. Experimental results on both synthetic and medical datasets demonstrate that the proposed method successfully unlearns targeted knowledge without compromising the model’s general reasoning performance.

Technology Category

Application Category

📝 Abstract

Large Reasoning Models (LRMs) generate structured chains of thought (CoTs) before producing final answers, making them especially vulnerable to knowledge leakage through intermediate reasoning steps. Yet, the memorization of sensitive information in the training data such as copyrighted and private content has led to ethical and legal concerns. To address these issues, selective forgetting (also known as machine unlearning) has emerged as a potential remedy for LRMs. However, existing unlearning methods primarily target final answers and may degrade the overall reasoning ability of LRMs after forgetting. Additionally, directly applying unlearning on the entire CoTs could degrade the general reasoning capabilities. The key challenge for LRM unlearning lies in achieving precise unlearning of targeted knowledge while preserving the integrity of general reasoning capabilities. To bridge this gap, we in this paper propose a novel LRM unlearning framework that selectively removes sensitive reasoning components while preserving general reasoning capabilities. Our approach leverages multiple LLMs with retrieval-augmented generation (RAG) to analyze CoT traces, identify forget-relevant segments, and replace them with benign placeholders that maintain logical structure. We also introduce a new feature replacement unlearning loss for LRMs, which can simultaneously suppress the probability of generating forgotten content while reinforcing structurally valid replacements. Extensive experiments on both synthetic and medical datasets verify the desired properties of our proposed method.

Problem

Research questions and friction points this paper is trying to address.

Selective Forgetting

Large Reasoning Models

Chain of Thought

Knowledge Leakage

Machine Unlearning

Innovation

Methods, ideas, or system contributions that make the work stand out.

Selective Forgetting

Large Reasoning Models

Chain-of-Thought