Backdoor Unlearning by Linear Task Decomposition

📅 2025-10-16

📈 Citations: 0

✨ Influential: 0

career value

219K/year

🤖 AI Summary

Foundation models exhibit strong generalization but remain vulnerable to backdoor attacks; existing mitigation methods rely on fine-tuning, often degrading multi-task performance. This paper proposes a fine-tuning-free linear task decomposition framework that, for the first time, reveals the intrinsic decoupling of backdoor behavior from benign tasks in weight space. Leveraging this insight, we design a “decouple–localize–unload” mechanism: backdoor subspaces are identified via inverse trigger reconstruction, enabling precise removal of malicious influences while preserving clean task representations. Evaluated on the CLIP architecture, our method achieves near 100% backdoor elimination under both known and unknown attacks, while retaining an average of 96% of the original accuracy across tasks. It fundamentally breaks the forgetting–performance trade-off, offering a scalable, low-overhead detoxification paradigm for large language and multimodal models.

Technology Category

Application Category

📝 Abstract

Foundation models have revolutionized computer vision by enabling broad generalization across diverse tasks. Yet, they remain highly susceptible to adversarial perturbations and targeted backdoor attacks. Mitigating such vulnerabilities remains an open challenge, especially given that the large-scale nature of the models prohibits retraining to ensure safety. Existing backdoor removal approaches rely on costly fine-tuning to override the harmful behavior, and can often degrade performance on other unrelated tasks. This raises the question of whether backdoors can be removed without compromising the general capabilities of the models. In this work, we address this question and study how backdoors are encoded in the model weight space, finding that they are disentangled from other benign tasks. Specifically, this separation enables the isolation and erasure of the backdoor's influence on the model with minimal impact on clean performance. Building on this insight, we introduce a simple unlearning method that leverages such disentanglement. Through extensive experiments with CLIP-based models and common adversarial triggers, we show that, given the knowledge of the attack, our method achieves approximately perfect unlearning, while retaining, on average, 96% of clean accuracy. Additionally, we demonstrate that even when the attack and its presence are unknown, our method successfully unlearns backdoors by proper estimation using reverse-engineered triggers. Overall, our method consistently yields better unlearning and clean accuracy tradeoffs when compared to present state-of-the-art defenses.

Problem

Research questions and friction points this paper is trying to address.

Removing backdoor attacks from foundation models without performance degradation

Isolating malicious backdoor components while preserving benign task capabilities

Achieving effective unlearning with minimal impact on clean model accuracy

Innovation

Methods, ideas, or system contributions that make the work stand out.

Linear task decomposition isolates backdoor influence

Unlearning method erases backdoors with minimal clean performance loss

Reverse-engineered triggers enable unlearning without prior attack knowledge

🔎 Similar Papers

Unified Neural Backdoor Removal with Only Few Clean Samples through Unlearning and Relearning