CoMAD: A Multiple-Teacher Self-Supervised Distillation Framework

📅 2025-08-06

📈 Citations: 0

✨ Influential: 0

career value

169K/year

🤖 AI Summary

Existing self-supervised pretraining paradigms—such as contrastive learning and masked modeling—suffer from isolated representation learning, model bloat, and poor deployability in resource-constrained settings. To address these issues, this paper proposes CoMAD, a lightweight, parameter-free self-supervised knowledge distillation framework. Its core contributions are: (1) a multi-teacher ViT collaborative guidance mechanism that fuses teacher features via asymmetric masking and joint consensus gating, where gating weights are dynamically computed from cosine similarity and inter-teacher consistency; (2) a dual-level KL divergence loss enforcing distributional alignment at both token-level and global representation levels; and (3) linear adapters to harmonize heterogeneous teacher feature spaces. On ImageNet-1K, the ViT-Tiny student achieves 75.4% top-1 accuracy. CoMAD also sets new state-of-the-art results for compact self-supervised models on ADE20K semantic segmentation and MS-COCO object detection and instance segmentation.

Technology Category

Application Category

📝 Abstract

Numerous self-supervised learning paradigms, such as contrastive learning and masked image modeling, learn powerful representations from unlabeled data but are typically pretrained in isolation, overlooking complementary insights and yielding large models that are impractical for resource-constrained deployment. To overcome these challenges, we introduce Consensus-oriented Masked Distillation (CoMAD), a lightweight, parameter-free framework that unifies knowledge from multiple current state-of-the-art self-supervised Vision Transformers into a compact student network. CoMAD distills from three pretrained ViT-Base teachers, MAE, MoCo v3, and iBOT, each offering distinct semantic and contextual priors. Rather than naively averaging teacher outputs, we apply asymmetric masking: the student sees only 25 percent of patches while each teacher receives a progressively lighter, unique mask, forcing the student to interpolate missing features under richer contexts. Teacher embeddings are aligned to the student's space via a linear adapter and layer normalization, then fused through our joint consensus gating, which weights each token by combining cosine affinity with inter-teacher agreement. The student is trained with dual-level KL divergence on visible tokens and reconstructed feature maps, capturing both local and global structure. On ImageNet-1K, CoMAD's ViT-Tiny achieves 75.4 percent Top-1, an increment of 0.4 percent over the previous state-of-the-art. In dense-prediction transfers, it attains 47.3 percent mIoU on ADE20K, and 44.5 percent box average precision and 40.5 percent mask average precision on MS-COCO, establishing a new state-of-the-art in compact SSL distillation.

Problem

Research questions and friction points this paper is trying to address.

Unifies knowledge from multiple self-supervised Vision Transformers

Distills large models into compact student networks

Improves performance in resource-constrained deployment scenarios

Innovation

Methods, ideas, or system contributions that make the work stand out.

Multiple-teacher self-supervised distillation framework

Asymmetric masking for feature interpolation

Joint consensus gating for token weighting

🔎 Similar Papers

No similar papers found.