BackWeak: Backdooring Knowledge Distillation Simply with Weak Triggers and Fine-tuning

๐Ÿ“… 2025-11-15
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF

career value

222K/year
๐Ÿค– AI Summary
In knowledge distillation (KD), reliance on third-party teacher models introduces backdoor vulnerabilities; existing attacks require surrogate models and simulated distillation, yielding strong yet conspicuous triggers with poor stealth. This paper proposes BackWeak: a lightweight, surrogate-free backdoor attack paradigm that bypasses simulated distillation entirely. Its core innovation is fine-tuning benign teacher models with an extremely low learning rate to embed โ€œweak triggersโ€โ€”minimal, near-imperceptible input perturbations whose adversarial effect is negligible. BackWeak formally defines the weak trigger concept for the first time, achieving high transferability while significantly improving stealth and computational efficiency. Extensive experiments across multiple datasets and model architectures demonstrate high attack success rates, with detection substantially more challenging than state-of-the-art methods.

Technology Category

Application Category

๐Ÿ“ Abstract
Knowledge Distillation (KD) is essential for compressing large models, yet relying on pre-trained "teacher" models downloaded from third-party repositories introduces serious security risks -- most notably backdoor attacks. Existing KD backdoor methods are typically complex and computationally intensive: they employ surrogate student models and simulated distillation to guarantee transferability, and they construct triggers in a way similar to universal adversarial perturbations (UAPs), which being not stealthy in magnitude, inherently exhibit strong adversarial behavior. This work questions whether such complexity is necessary and constructs stealthy "weak" triggers -- imperceptible perturbations that have negligible adversarial effect. We propose BackWeak, a simple, surrogate-free attack paradigm. BackWeak shows that a powerful backdoor can be implanted by simply fine-tuning a benign teacher with a weak trigger using a very small learning rate. We demonstrate that this delicate fine-tuning is sufficient to embed a backdoor that reliably transfers to diverse student architectures during a victim's standard distillation process, yielding high attack success rates. Extensive empirical evaluations on multiple datasets, model architectures, and KD methods show that BackWeak is efficient, simpler, and often more stealthy than previous elaborate approaches. This work calls on researchers studying KD backdoor attacks to pay particular attention to the trigger's stealthiness and its potential adversarial characteristics.
Problem

Research questions and friction points this paper is trying to address.

Developing stealthy weak triggers for backdoor attacks in knowledge distillation
Eliminating surrogate models and complex adversarial perturbations in KD attacks
Ensuring backdoor transferability across diverse student architectures during standard distillation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses weak triggers with negligible adversarial effects
Fine-tunes benign teacher models with small learning rate
Eliminates surrogate models and complex trigger construction
๐Ÿ”Ž Similar Papers
No similar papers found.