Protecting Language Models Against Unauthorized Distillation through Trace Rewriting

📅 2026-02-16

📈 Citations: 0

✨ Influential: 0

career value

123K/year

🤖 AI Summary

This work addresses the intellectual property risks posed by unauthorized knowledge distillation from large language models (LLMs) by proposing a method that dynamically rewrites the teacher model’s inference trajectories. The approach reduces the training utility of distilled data while embedding a verifiable watermark, all without compromising answer correctness or semantic coherence. It uniquely integrates anti-distillation defense with API-level watermarking, leveraging instruction-based LLM rewriting and gradient-based optimization to introduce dynamic perturbations into the inference process. Experimental results demonstrate that the proposed scheme significantly degrades the effectiveness of unauthorized distillation, achieves high watermark detection accuracy with near-zero false positives, and maintains—or even slightly enhances—the original performance of the teacher model.

Technology Category

Application Category

📝 Abstract

Knowledge distillation is a widely adopted technique for transferring capabilities from LLMs to smaller, more efficient student models. However, unauthorized use of knowledge distillation takes unfair advantage of the considerable effort and cost put into developing frontier models. We investigate methods for modifying teacher-generated reasoning traces to achieve two objectives that deter unauthorized distillation: (1) \emph{anti-distillation}, or degrading the training usefulness of query responses, and (2) \emph{API watermarking}, which embeds verifiable signatures in student models. We introduce several approaches for dynamically rewriting a teacher's reasoning outputs while preserving answer correctness and semantic coherence. Two of these leverage the rewriting capabilities of LLMs, while others use gradient-based techniques. Our experiments show that a simple instruction-based rewriting approach achieves a strong anti-distillation effect while maintaining or even improving teacher performance. Furthermore, we show that our rewriting approach also enables highly reliable watermark detection with essentially no false alarms.

Problem

Research questions and friction points this paper is trying to address.

unauthorized distillation

language model protection

anti-distillation

API watermarking

knowledge distillation

Innovation

Methods, ideas, or system contributions that make the work stand out.

anti-distillation

API watermarking

trace rewriting