🤖 AI Summary
Existing knowledge distillation methods typically combine distillation loss with cross-entropy via manually tuned weighting coefficients, introducing hyperparameter sensitivity. This paper proposes Plackett–Luce Distillation (PLD), the first to incorporate the Plackett–Luce model—a well-established choice-theoretic ranking model—into knowledge distillation. PLD interprets teacher logits as category “utility” scores and formulates a listwise ranking loss that jointly weights all classes by their predicted confidences. The resulting loss is convex and translation-invariant, inherently integrating label priority and teacher confidence without requiring additional hyperparameters. On standard image classification benchmarks, PLD achieves consistent improvements: under homogeneous settings, it outperforms DIST and vanilla KD by +0.42% and +1.04% in Top-1 accuracy, respectively; under heterogeneous settings, gains are +0.48% and +1.09%. These results demonstrate significantly enhanced knowledge transfer efficiency and generalization capability.
📝 Abstract
Knowledge distillation is a model compression technique in which a compact"student"network is trained to replicate the predictive behavior of a larger"teacher"network. In logit-based knowledge distillation it has become the de facto approach to augment cross-entropy with a distillation term. Typically this term is either a KL divergence-matching marginal probabilities or a correlation-based loss capturing intra- and inter-class relationships but in every case it sits as an add-on to cross-entropy with its own weight that must be carefully tuned. In this paper we adopt a choice-theoretic perspective and recast knowledge distillation under the Plackett-Luce model by interpreting teacher logits as"worth"scores. We introduce Plackett-Luce Distillation (PLD), a weighted list-wise ranking loss in which the teacher model transfers knowledge of its full ranking of classes, weighting each ranked choice by its own confidence. PLD directly optimizes a single teacher-optimal ranking of the true label first, followed by the remaining classes in descending teacher confidence, yielding a convex, translation-invariant surrogate that subsumes weighted cross-entropy. Empirically on standard image classification benchmarks, PLD improves Top-1 accuracy by an average of +0.42% over DIST (arXiv:2205.10536) and +1.04% over KD (arXiv:1503.02531) in homogeneous settings and by +0.48% and +1.09% over DIST and KD, respectively, in heterogeneous settings.