🤖 AI Summary
This work addresses the challenge that large language models often suffer degradation in general capabilities when selectively forgetting specific knowledge, reflecting a fundamental trade-off between forgetting and retention. To tackle this issue, the paper formulates the problem as an asymmetric dual-task learning framework prioritizing knowledge retention over forgetting. It introduces SAGO, a retention-prioritized gradient synthesis framework that constructively aligns gradients of the retention task through sign constraints, integrates an enhanced PCGrad-based conflict resolution mechanism, and employs a gradient geometry reshaping strategy to optimize the multi-task gradient structure. Experimental results demonstrate that SAGO substantially advances the Pareto frontier, recovering MMLU performance from 44.6% to 96.0% on the WMDP Bio benchmark while maintaining strong forgetting efficacy.
📝 Abstract
Machine unlearning for large language models (LLMs) aims to remove targeted knowledge while preserving general capability. In this paper, we recast LLM unlearning as an asymmetric two-task problem: retention is the primary objective and forgetting is an auxiliary. From this perspective, we propose a retention-prioritized gradient synthesis framework that decouples task-specific gradient extraction from conflict-aware combination. Instantiating the framework, we adapt established PCGrad to resolve gradient conflicts, and introduce SAGO, a novel retention-prioritized gradient synthesis method. Theoretically, both variants ensure non-negative cosine similarity with the retain gradient, while SAGO achieves strictly tighter alignment through constructive sign-constrained synthesis. Empirically, on WMDP Bio/Cyber and RWKU benchmarks, SAGO consistently pushes the Pareto frontier: e.g., on WMDP Bio (SimNPO+GD), recovery of target model MMLU performance progresses from 44.6% (naive) to 94.0% (+PCGrad) and further to 96.0% (+SAGO), while maintaining comparable forgetting strength. Our results show that re-shaping gradient geometry, rather than re-balancing losses, is the key to mitigating unlearning-retention trade-offs.