MoKD: Multi-Task Optimization for Knowledge Distillation

📅 2025-05-13

📈 Citations: 0

✨ Influential: 0

career value

225K/year

🤖 AI Summary

Knowledge distillation faces two key challenges: gradient conflict between teacher and student models, and objective imbalance, further exacerbated by representational disparity. This paper proposes a multi-objective gradient alignment framework that formulates distillation as a collaborative optimization task, coupled with a learnable subspace feature projection mechanism to mitigate gradient dominance and representation mismatch. For the first time, the student model achieves comprehensive superiority over same-scale from-scratch baselines on both ImageNet-1K (classification) and COCO (detection), attaining new state-of-the-art accuracy and efficiency in both tasks. The core contributions are: (1) a dynamic multi-objective gradient balancing mechanism operating at the gradient level; (2) a learnable subspace projection that bridges the teacher–student representation gap; and (3) an end-to-end distillation paradigm requiring no additional parameters or inference overhead.

Technology Category

Application Category

📝 Abstract

Compact models can be effectively trained through Knowledge Distillation (KD), a technique that transfers knowledge from larger, high-performing teacher models. Two key challenges in Knowledge Distillation (KD) are: 1) balancing learning from the teacher's guidance and the task objective, and 2) handling the disparity in knowledge representation between teacher and student models. To address these, we propose Multi-Task Optimization for Knowledge Distillation (MoKD). MoKD tackles two main gradient issues: a) Gradient Conflicts, where task-specific and distillation gradients are misaligned, and b) Gradient Dominance, where one objective's gradient dominates, causing imbalance. MoKD reformulates KD as a multi-objective optimization problem, enabling better balance between objectives. Additionally, it introduces a subspace learning framework to project feature representations into a high-dimensional space, improving knowledge transfer. Our MoKD is demonstrated to outperform existing methods through extensive experiments on image classification using the ImageNet-1K dataset and object detection using the COCO dataset, achieving state-of-the-art performance with greater efficiency. To the best of our knowledge, MoKD models also achieve state-of-the-art performance compared to models trained from scratch.

Problem

Research questions and friction points this paper is trying to address.

Balancing teacher guidance and task objectives in KD

Resolving gradient conflicts and dominance in distillation

Improving knowledge transfer via subspace learning framework

Innovation

Methods, ideas, or system contributions that make the work stand out.

Multi-task optimization balances distillation and task objectives

Subspace learning improves knowledge transfer efficiency

Addresses gradient conflicts and dominance in distillation

🔎 Similar Papers

Can Optimization Trajectories Explain Multi-Task Transfer?