DOGe: Defensive Output Generation for LLM Protection Against Knowledge Distillation

📅 2025-05-26
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Large language models (LLMs) deployed via APIs are vulnerable to knowledge distillation (KD) attacks that exploit raw output text, enabling adversaries to train high-fidelity student models. Method: We propose a lightweight, real-time defensive output generation mechanism that fine-tunes only the final linear layer of the teacher model, introduces adversarial perturbations in the output space, and optimizes the perturbation strategy via adversarial training—without degrading utility for legitimate users. Contribution/Results: To our knowledge, this is the first real-time defense specifically designed for pure-text-output KD that requires tuning only a single model layer. Experiments demonstrate that the teacher’s performance remains stable or slightly improves, while distilled student models suffer catastrophic performance degradation—validating the method’s strong robustness, practicality, and deployment efficiency.

Technology Category

Application Category

📝 Abstract
Large Language Models (LLMs) represent substantial intellectual and economic investments, yet their effectiveness can inadvertently facilitate model imitation via knowledge distillation (KD).In practical scenarios, competitors can distill proprietary LLM capabilities by simply observing publicly accessible outputs, akin to reverse-engineering a complex performance by observation alone. Existing protective methods like watermarking only identify imitation post-hoc, while other defenses assume the student model mimics the teacher's internal logits, rendering them ineffective against distillation purely from observed output text. This paper confronts the challenge of actively protecting LLMs within the realistic constraints of API-based access. We introduce an effective and efficient Defensive Output Generation (DOGe) strategy that subtly modifies the output behavior of an LLM. Its outputs remain accurate and useful for legitimate users, yet are designed to be misleading for distillation, significantly undermining imitation attempts. We achieve this by fine-tuning only the final linear layer of the teacher LLM with an adversarial loss. This targeted training approach anticipates and disrupts distillation attempts during inference time. Our experiments show that, while preserving or even improving the original performance of the teacher model, student models distilled from the defensively generated teacher outputs demonstrate catastrophically reduced performance, demonstrating our method's effectiveness as a practical safeguard against KD-based model imitation.
Problem

Research questions and friction points this paper is trying to address.

Protecting LLMs from imitation via knowledge distillation
Preventing distillation from publicly accessible outputs
Ensuring outputs mislead imitation but remain useful
Innovation

Methods, ideas, or system contributions that make the work stand out.

Adversarial fine-tuning of final layer
Subtle output modification for defense
Preserves accuracy while disrupting distillation
🔎 Similar Papers
No similar papers found.