Why LoRA Fails to Forget: Regularized Low-Rank Adaptation Against Backdoors in Language Models

📅 2026-01-09

🏛️ arXiv.org

📈 Citations: 0

✨ Influential: 0

career value

168K/year

🤖 AI Summary

This work addresses the failure of Low-Rank Adaptation (LoRA) in backdoor removal tasks, which often stems from insufficient spectral strength and poor spectral alignment in its update matrices. From a spectral analysis perspective, the study introduces Regularized Low-Rank Adaptation (RoRA), a novel method that enhances spectral strength and refines alignment direction through regularized fine-tuning, trigger-insensitive constraints, and spectral rescaling. Theoretical analysis establishes precise thresholds for spectral strength and alignment required for effective backdoor forgetting. Extensive experiments demonstrate that RoRA significantly reduces attack success rates across diverse NLP benchmarks and attack configurations while preserving model performance on clean data.

Technology Category

Application Category

📝 Abstract

Low-Rank Adaptation (LoRA) is widely used for parameter-efficient fine-tuning of large language models, but it is notably ineffective at removing backdoor behaviors from poisoned pretrained models when fine-tuning on clean dataset. Contrary to the common belief that this weakness is caused primarily by low rank, we show that LoRA's vulnerability is fundamentally spectral. Our analysis identifies two key factors: LoRA updates (i) possess insufficient spectral strength, with singular values far below those of pretrained weights, and (ii) exhibit unfavorable spectral alignment, weakly matching clean-task directions while retaining overlap with trigger-sensitive subspaces. We further establish a critical scaling threshold beyond which LoRA can theoretically suppress trigger-induced activations, and we show empirically that standard LoRA rarely reaches this regime. We introduce Regularized Low-Rank Adaptation (RoRA), which improves forgetting by increasing spectral strength and correcting alignment through clean-strengthened regularization, trigger-insensitive constraints, and post-training spectral rescaling. Experiments across multiple NLP benchmarks and attack settings show that RoRA substantially reduces attack success rates while maintaining clean accuracy.

Problem

Research questions and friction points this paper is trying to address.

LoRA

backdoor

language models

fine-tuning

forgetting

Innovation

Methods, ideas, or system contributions that make the work stand out.

Low-Rank Adaptation

Backdoor Forgetting

Spectral Analysis