The Surprising Effectiveness of Canonical Knowledge Distillation for Semantic Segmentation

📅 2026-04-28

📈 Citations: 0

✨ Influential: 0

career value

140K/year

🤖 AI Summary

This study addresses the ambiguity surrounding the performance gains of existing knowledge distillation methods for semantic segmentation, which often rely on complex hand-crafted objectives whose benefits may stem either from stronger distillation signals or additional computation. Under strictly matched training-time budgets, the authors systematically evaluate the effectiveness of classic logit- and feature-alignment-based distillation strategies. Their experiments demonstrate that, within identical computational constraints, standard knowledge distillation consistently outperforms recent task-specific approaches, challenging the prevailing assumption that specialized mechanisms are necessary. Notably, a ResNet-18 student model achieves 99% and 92% of the teacher’s mIoU on Cityscapes and ADE20K, respectively, while using only one-quarter of the teacher’s parameters and approaching state-of-the-art performance.

📝 Abstract

Recent knowledge distillation (KD) methods for semantic segmentation introduce increasingly complex hand-crafted objectives, yet are typically evaluated under fixed iteration schedules. These objectives substantially increase per-iteration cost, meaning equal iteration counts do not correspond to equal training budgets. It is therefore unclear whether reported gains reflect stronger distillation signals or simply greater compute. We show that iteration-based comparisons are misleading: when wall-clock compute is matched, \textit{canonical} logit- and feature-based KD outperform recent segmentation-specific methods. Under extended training, feature-based distillation achieves state-of-the-art ResNet-18 performance on Cityscapes and ADE20K. A PSPNet ResNet-18 student closely approaches its ResNet-101 teacher despite using only one quarter of the parameters, reaching 99\% of the teacher's mIoU on Cityscapes (79.0 vs.\ 79.8) and 92\% on ADE20K. Our results challenge the prevailing assumption that KD for segmentation requires task-specific mechanisms and suggest that scaling, rather than complex hand-crafted objectives, should guide future method design.

Problem

Research questions and friction points this paper is trying to address.

knowledge distillation

semantic segmentation

training budget

compute efficiency

iteration-based evaluation

Innovation

Methods, ideas, or system contributions that make the work stand out.

knowledge distillation

semantic segmentation

canonical distillation