๐ค AI Summary
Knowledge distillation faces two key challenges: gradient conflict between teacher and student models, and objective imbalance, further exacerbated by representational disparity. This paper proposes a multi-objective gradient alignment framework that formulates distillation as a collaborative optimization task, coupled with a learnable subspace feature projection mechanism to mitigate gradient dominance and representation mismatch. For the first time, the student model achieves comprehensive superiority over same-scale from-scratch baselines on both ImageNet-1K (classification) and COCO (detection), attaining new state-of-the-art accuracy and efficiency in both tasks. The core contributions are: (1) a dynamic multi-objective gradient balancing mechanism operating at the gradient level; (2) a learnable subspace projection that bridges the teacherโstudent representation gap; and (3) an end-to-end distillation paradigm requiring no additional parameters or inference overhead.
๐ Abstract
Compact models can be effectively trained through Knowledge Distillation (KD), a technique that transfers knowledge from larger, high-performing teacher models. Two key challenges in Knowledge Distillation (KD) are: 1) balancing learning from the teacher's guidance and the task objective, and 2) handling the disparity in knowledge representation between teacher and student models. To address these, we propose Multi-Task Optimization for Knowledge Distillation (MoKD). MoKD tackles two main gradient issues: a) Gradient Conflicts, where task-specific and distillation gradients are misaligned, and b) Gradient Dominance, where one objective's gradient dominates, causing imbalance. MoKD reformulates KD as a multi-objective optimization problem, enabling better balance between objectives. Additionally, it introduces a subspace learning framework to project feature representations into a high-dimensional space, improving knowledge transfer. Our MoKD is demonstrated to outperform existing methods through extensive experiments on image classification using the ImageNet-1K dataset and object detection using the COCO dataset, achieving state-of-the-art performance with greater efficiency. To the best of our knowledge, MoKD models also achieve state-of-the-art performance compared to models trained from scratch.