🤖 AI Summary
Conventional knowledge distillation leverages the teacher model exclusively during training, discarding it entirely at inference—thereby capping the performance ceiling of the student model.
Method: We propose “Query-Aware Knowledge Distillation,” a framework enabling the student to both acquire teacher knowledge during training and dynamically invoke the teacher during inference based on real-time difficulty assessment and operational constraints—jointly optimizing *what* to learn and *when* to query. We formulate this as an entropy-regularized value optimization problem, integrating path-consistent learning, online/offline policy demonstration, constrained reinforcement learning, and dynamic difficulty awareness.
Contribution/Results: Our approach significantly improves the accuracy–teacher-invocation trade-off curve on neural machine translation and text summarization tasks, surpassing speculative decoding in balancing performance gains against computational overhead—effectively breaking its inherent performance–cost bottleneck.
📝 Abstract
Knowledge distillation is used, in generative language modeling, to train a smaller student model using the help of a larger teacher model, resulting in improved capabilities for the student model. In this paper, we formulate a more general framework for knowledge distillation where the student learns from the teacher during training, and also learns to ask for the teacher's help at test-time following rules specifying test-time restrictions. Towards this, we first formulate knowledge distillation as an entropy-regularized value optimization problem. Adopting Path Consistency Learning to solve this, leads to a new knowledge distillation algorithm using on-policy and off-policy demonstrations. We extend this using constrained reinforcement learning to a framework that incorporates the use of the teacher model as a test-time reference, within constraints. In this situation, akin to a human learner, the model needs to learn not only the learning material, but also the relative difficulty of different sections to prioritize for seeking teacher help. We examine the efficacy of our method through experiments in translation and summarization tasks, observing trends in accuracy and teacher use, noting that our approach unlocks operating points not available to the popular Speculative Decoding approach.