Pay Attention to the Triggers: Constructing Backdoors That Survive Distillation

📅 2025-10-21
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work identifies a critical security blind spot in knowledge distillation: the failure of backdoor transfer from teacher to student models. Existing large language model (LLM) backdoors—relying on infrequent or semantically incongruous trigger tokens—are often eliminated during distillation-induced model compression. To address this, we propose T-MTB, the first composite multi-token backdoor explicitly designed for distillation robustness. T-MTB constructs semantically coherent and highly隐蔽 triggers using naturally co-occurring high-frequency phrases. Through systematic adversarial analysis and cross-model-family evaluation across four major LLM families, we demonstrate that T-MTB significantly improves backdoor survival and transfer success rates in both jailbreaking and content-control scenarios. Our study is the first to empirically validate and mitigate the backdoor security risk inherent in knowledge distillation, establishing a foundation for secure model compression.

Technology Category

Application Category

📝 Abstract
LLMs are often used by downstream users as teacher models for knowledge distillation, compressing their capabilities into memory-efficient models. However, as these teacher models may stem from untrusted parties, distillation can raise unexpected security risks. In this paper, we investigate the security implications of knowledge distillation from backdoored teacher models. First, we show that prior backdoors mostly do not transfer onto student models. Our key insight is that this is because existing LLM backdooring methods choose trigger tokens that rarely occur in usual contexts. We argue that this underestimates the security risks of knowledge distillation and introduce a new backdooring technique, T-MTB, that enables the construction and study of transferable backdoors. T-MTB carefully constructs a composite backdoor trigger, made up of several specific tokens that often occur individually in anticipated distillation datasets. As such, the poisoned teacher remains stealthy, while during distillation the individual presence of these tokens provides enough signal for the backdoor to transfer onto the student. Using T-MTB, we demonstrate and extensively study the security risks of transferable backdoors across two attack scenarios, jailbreaking and content modulation, and across four model families of LLMs.
Problem

Research questions and friction points this paper is trying to address.

Investigating security risks of knowledge distillation from backdoored teacher models
Developing transferable backdoors that survive model compression processes
Demonstrating backdoor transfer across jailbreaking and content modulation scenarios
Innovation

Methods, ideas, or system contributions that make the work stand out.

Constructs composite triggers from common tokens
Enables backdoor transfer during knowledge distillation
Maintains stealth while poisoning teacher models
🔎 Similar Papers
No similar papers found.