How to Backdoor the Knowledge Distillation

📅 2025-04-30

📈 Citations: 0

✨ Influential: 0

career value

184K/year

🤖 AI Summary

Conventional knowledge distillation is widely assumed secure under the premise that the teacher model is clean and the distillation dataset is untainted. This work challenges that assumption by proposing the first post-distillation backdoor attack that requires no modification to the teacher model. Method: The attack stealthily compromises student models by injecting trigger-embedded adversarial examples into the distillation dataset, while preserving teacher performance and integrity. It integrates adversarial example generation, trigger embedding, and reverse-engineering of the distillation pipeline. Contribution/Results: Evaluated across multiple datasets and model architectures, the attack achieves high success rates, strong stealthiness, and cross-architecture transferability. This reveals a fundamental flaw in the “clean teacher implies security” assumption and exposes a critical, long-overlooked security vulnerability in knowledge distillation protocols—highlighting urgent implications for trustworthy model compression.

Technology Category

Application Category

📝 Abstract

Knowledge distillation has become a cornerstone in modern machine learning systems, celebrated for its ability to transfer knowledge from a large, complex teacher model to a more efficient student model. Traditionally, this process is regarded as secure, assuming the teacher model is clean. This belief stems from conventional backdoor attacks relying on poisoned training data with backdoor triggers and attacker-chosen labels, which are not involved in the distillation process. Instead, knowledge distillation uses the outputs of a clean teacher model to guide the student model, inherently preventing recognition or response to backdoor triggers as intended by an attacker. In this paper, we challenge this assumption by introducing a novel attack methodology that strategically poisons the distillation dataset with adversarial examples embedded with backdoor triggers. This technique allows for the stealthy compromise of the student model while maintaining the integrity of the teacher model. Our innovative approach represents the first successful exploitation of vulnerabilities within the knowledge distillation process using clean teacher models. Through extensive experiments conducted across various datasets and attack settings, we demonstrate the robustness, stealthiness, and effectiveness of our method. Our findings reveal previously unrecognized vulnerabilities and pave the way for future research aimed at securing knowledge distillation processes against backdoor attacks.

Problem

Research questions and friction points this paper is trying to address.

Exposing vulnerabilities in secure knowledge distillation process

Introducing backdoor attacks via adversarial distillation dataset poisoning

Compromising student models stealthily using clean teacher models

Innovation

Methods, ideas, or system contributions that make the work stand out.

Poison distillation dataset with adversarial examples

Embed backdoor triggers stealthily

Exploit vulnerabilities using clean teacher models

🔎 Similar Papers

No similar papers found.