MIM-Refiner: A Contrastive Learning Boost from Intermediate Pre-Trained Representations

📅 2024-02-15

🏛️ arXiv.org

📈 Citations: 6

✨ Influential: 1

career value

213K/year

🤖 AI Summary

To address the underexploited potential of intermediate-layer representations in masked image modeling (MIM) pre-trained models, this paper proposes MIM-Refiner—a lightweight, plug-and-play framework that systematically leverages strong semantic features from intermediate layers of Vision Transformers (ViTs). Without fine-tuning the backbone, it refines representations via a hierarchical contrastive head and a semantic-aware nearest-neighbor clustering objective. The method requires no additional annotations and is compatible with diverse MIM architectures (e.g., data2vec 2.0 ViT-H). On ImageNet-1K linear probing, it achieves 84.7%, setting a new state-of-the-art. Moreover, it significantly improves performance across downstream tasks—including few-shot classification, long-tailed recognition, unsupervised clustering, and semantic segmentation—demonstrating the high transferability and plasticity of intermediate-layer representations.

Technology Category

Application Category

📝 Abstract

We introduce MIM (Masked Image Modeling)-Refiner, a contrastive learning boost for pre-trained MIM models. MIM-Refiner is motivated by the insight that strong representations within MIM models generally reside in intermediate layers. Accordingly, MIM-Refiner leverages multiple contrastive heads that are connected to different intermediate layers. In each head, a modified nearest neighbor objective constructs semantic clusters that capture semantic information which improves performance on downstream tasks, including off-the-shelf and fine-tuning settings. The refinement process is short and simple - yet highly effective. Within a few epochs, we refine the features of MIM models from subpar to state-of-the-art, off-the-shelf features. Refining a ViT-H, pre-trained with data2vec 2.0 on ImageNet-1K, sets a new state-of-the-art in linear probing (84.7%) and low-shot classification among models that are pre-trained on ImageNet-1K. MIM-Refiner efficiently combines the advantages of MIM and ID objectives and compares favorably against previous state-of-the-art SSL models on a variety of benchmarks such as low-shot classification, long-tailed classification, clustering and semantic segmentation.

Problem

Research questions and friction points this paper is trying to address.

Enhance pre-trained MIM models

Leverage intermediate layer representations

Improve downstream task performance

Innovation

Methods, ideas, or system contributions that make the work stand out.

Contrastive learning boost

Intermediate layer representations

Modified nearest neighbor objective

🔎 Similar Papers

No similar papers found.