Training a Student Expert via Semi-Supervised Foundation Model Distillation

📅 2026-04-04

📈 Citations: 0

✨ Influential: 0

career value

160K/year

🤖 AI Summary

This work addresses the high computational cost of vision foundation models and their reliance on expensive pixel-level annotations by proposing a semi-supervised knowledge distillation framework that compresses a large teacher model into a lightweight student using only limited labeled data and abundant unlabeled data. The approach comprises three stages—domain adaptation, multi-objective knowledge transfer, and student refinement—and introduces an instance-aware pixel-level contrastive loss. This loss leverages both mask and class scores to construct high-quality negative samples and preserves contrastive signals during distillation to align teacher and student embeddings. Integrated with self-training, contrastive calibration, a unified multi-objective loss, and a pseudo-label bias mitigation mechanism, the resulting student model—11× smaller than the teacher—outperforms the zero-shot teacher by 11.9/8.6 AP and the fine-tuned teacher by 3.4/1.5 AP on Cityscapes and ADE20K, significantly surpassing existing semi-supervised distillation methods.

Technology Category

Application Category

📝 Abstract

Foundation models deliver strong perception but are often too computationally heavy to deploy, and adapting them typically requires costly annotations. We introduce a semi-supervised knowledge distillation (SSKD) framework that compresses pre-trained vision foundation models (VFMs) into compact experts using limited labeled and abundant unlabeled data, and instantiate it for instance segmentation where per-pixel labels are particularly expensive. The framework unfolds in three stages: (1) domain adaptation of the VFM(s) via self-training with contrastive calibration, (2) knowledge transfer through a unified multi-objective loss, and (3) student refinement to mitigate residual pseudo-label bias. Central to our approach is an instance-aware pixel-wise contrastive loss that fuses mask and class scores to extract informative negatives and enforce clear inter-instance margins. By maintaining this contrastive signal across both adaptation and distillation, we align teacher and student embeddings and more effectively leverage unlabeled images. On Cityscapes and ADE20K, our $\approx 11\times$ smaller student improves over its zero-shot VFM teacher(s) by +11.9 and +8.6 AP, surpasses adapted teacher(s) by +3.4 and +1.5 AP, and outperforms state-of-the-art SSKD methods on benchmarks.

Problem

Research questions and friction points this paper is trying to address.

semi-supervised learning

knowledge distillation

vision foundation models

instance segmentation

label efficiency

Innovation

Methods, ideas, or system contributions that make the work stand out.

semi-supervised knowledge distillation

vision foundation model

instance-aware contrastive loss