🤖 AI Summary
Large language models (LLMs) face security risks as their inference traces can be exploited for knowledge distillation, enabling unauthorized model extraction. Method: This paper proposes a distillation-resistant sampling strategy that actively “poisons” inference trajectories by probabilistically reweighting the next-token distribution under controllable entropy constraints—without degrading the original model’s performance. Crucially, it decouples security enforcement from functional integrity and dynamically modulates perturbation intensity based on path-sensitivity analysis. Contribution/Results: Experiments across multiple benchmarks demonstrate that the method preserves ≥99% of the teacher model’s accuracy while reducing the distilled student model’s performance by 35–62%, significantly enhancing robustness against distillation attacks. This work establishes a novel paradigm for copyright protection of generative AI models.
📝 Abstract
Frontier models that generate extended reasoning traces inadvertently produce rich token sequences that can facilitate model distillation. Recognizing this vulnerability, model owners may seek sampling strategies that limit the effectiveness of distillation without compromising model performance. emph{Antidistillation sampling} provides exactly this capability. By strategically modifying a model's next-token probability distribution, antidistillation sampling poisons reasoning traces, rendering them significantly less effective for distillation while preserving the model's practical utility. For further details, see https://antidistillation.com.