CAST: Contrastive Adaptation and Distillation for Semi-Supervised Instance Segmentation

📅 2025-05-28
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the high cost of pixel-level annotations in instance segmentation and the deployment challenges posed by large-scale vision foundation models, this paper proposes CAST, a semi-supervised knowledge distillation framework that compresses foundation models into lightweight expert models under limited labeled and abundant unlabeled data. Our key contributions are: (1) an instance-aware pixel-wise contrastive loss that jointly leverages mask structure and class confidence to mine hard negative samples, enabling fine-grained embedding alignment; and (2) a three-stage collaborative pipeline—contrastive adaptation → contrastive distillation → bias correction—that effectively mitigates pseudo-label noise. On Cityscapes and ADE20K, the student model achieves AP scores of 33.9 vs. 30.5 and 16.7 vs. 15.2, respectively, while being only 1/11 the size of the teacher model, significantly outperforming existing semi-supervised approaches.

Technology Category

Application Category

📝 Abstract
Instance segmentation demands costly per-pixel annotations and large models. We introduce CAST, a semi-supervised knowledge distillation (SSKD) framework that compresses pretrained vision foundation models (VFM) into compact experts using limited labeled and abundant unlabeled data. CAST unfolds in three stages: (1) domain adaptation of the VFM teacher(s) via self-training with contrastive pixel calibration, (2) distillation into a compact student via a unified multi-objective loss that couples standard supervision and pseudo-labels with our instance-aware pixel-wise contrastive term, and (3) fine-tuning on labeled data to remove residual pseudo-label bias. Central to CAST is an emph{instance-aware pixel-wise contrastive loss} that fuses mask and class scores to mine informative negatives and enforce clear inter-instance margins. By maintaining this contrastive signal across both adaptation and distillation, we align teacher and student embeddings and fully leverage unlabeled images. On Cityscapes and ADE20K, our ~11X smaller student surpasses its adapted VFM teacher(s) by +3.4 AP (33.9 vs. 30.5) and +1.5 AP (16.7 vs. 15.2) and outperforms state-of-the-art semi-supervised approaches.
Problem

Research questions and friction points this paper is trying to address.

Reducing costly pixel annotations in instance segmentation
Compressing large vision models with limited labeled data
Improving segmentation accuracy via contrastive adaptation and distillation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Semi-supervised knowledge distillation framework
Instance-aware pixel-wise contrastive loss
Domain adaptation via self-training
🔎 Similar Papers
No similar papers found.
P
Pardis Taghavi
Texas A&M University
T
Tian Liu
Texas A&M University
R
Renjie Li
Texas A&M University
Reza Langari
Reza Langari
Texas A&M University
Zhengzhong Tu
Zhengzhong Tu
Texas A&M University, Google Research, University of Texas at Austin
Agentic AITrustworthy AIEmbodied AI