CoMAD: A Multiple-Teacher Self-Supervised Distillation Framework

📅 2025-08-06
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing self-supervised pretraining paradigms—such as contrastive learning and masked modeling—suffer from isolated representation learning, model bloat, and poor deployability in resource-constrained settings. To address these issues, this paper proposes CoMAD, a lightweight, parameter-free self-supervised knowledge distillation framework. Its core contributions are: (1) a multi-teacher ViT collaborative guidance mechanism that fuses teacher features via asymmetric masking and joint consensus gating, where gating weights are dynamically computed from cosine similarity and inter-teacher consistency; (2) a dual-level KL divergence loss enforcing distributional alignment at both token-level and global representation levels; and (3) linear adapters to harmonize heterogeneous teacher feature spaces. On ImageNet-1K, the ViT-Tiny student achieves 75.4% top-1 accuracy. CoMAD also sets new state-of-the-art results for compact self-supervised models on ADE20K semantic segmentation and MS-COCO object detection and instance segmentation.

Technology Category

Application Category

📝 Abstract
Numerous self-supervised learning paradigms, such as contrastive learning and masked image modeling, learn powerful representations from unlabeled data but are typically pretrained in isolation, overlooking complementary insights and yielding large models that are impractical for resource-constrained deployment. To overcome these challenges, we introduce Consensus-oriented Masked Distillation (CoMAD), a lightweight, parameter-free framework that unifies knowledge from multiple current state-of-the-art self-supervised Vision Transformers into a compact student network. CoMAD distills from three pretrained ViT-Base teachers, MAE, MoCo v3, and iBOT, each offering distinct semantic and contextual priors. Rather than naively averaging teacher outputs, we apply asymmetric masking: the student sees only 25 percent of patches while each teacher receives a progressively lighter, unique mask, forcing the student to interpolate missing features under richer contexts. Teacher embeddings are aligned to the student's space via a linear adapter and layer normalization, then fused through our joint consensus gating, which weights each token by combining cosine affinity with inter-teacher agreement. The student is trained with dual-level KL divergence on visible tokens and reconstructed feature maps, capturing both local and global structure. On ImageNet-1K, CoMAD's ViT-Tiny achieves 75.4 percent Top-1, an increment of 0.4 percent over the previous state-of-the-art. In dense-prediction transfers, it attains 47.3 percent mIoU on ADE20K, and 44.5 percent box average precision and 40.5 percent mask average precision on MS-COCO, establishing a new state-of-the-art in compact SSL distillation.
Problem

Research questions and friction points this paper is trying to address.

Unifies knowledge from multiple self-supervised Vision Transformers
Distills large models into compact student networks
Improves performance in resource-constrained deployment scenarios
Innovation

Methods, ideas, or system contributions that make the work stand out.

Multiple-teacher self-supervised distillation framework
Asymmetric masking for feature interpolation
Joint consensus gating for token weighting
🔎 Similar Papers
No similar papers found.
Sriram Mandalika
Sriram Mandalika
Hasso Plattner Institute, University of Potsdam
Deep LearningComputer VisionLearning MethodsDecision Making
L
Lalitha V
Department of Electronics and Communication Engineering, Faculty of Engineering and Technology, SRM Institute of Science and Technology, Kattankulathur, Tamil Nadu, 603203, India