🤖 AI Summary
To address the underexploited potential of intermediate-layer representations in masked image modeling (MIM) pre-trained models, this paper proposes MIM-Refiner—a lightweight, plug-and-play framework that systematically leverages strong semantic features from intermediate layers of Vision Transformers (ViTs). Without fine-tuning the backbone, it refines representations via a hierarchical contrastive head and a semantic-aware nearest-neighbor clustering objective. The method requires no additional annotations and is compatible with diverse MIM architectures (e.g., data2vec 2.0 ViT-H). On ImageNet-1K linear probing, it achieves 84.7%, setting a new state-of-the-art. Moreover, it significantly improves performance across downstream tasks—including few-shot classification, long-tailed recognition, unsupervised clustering, and semantic segmentation—demonstrating the high transferability and plasticity of intermediate-layer representations.
📝 Abstract
We introduce MIM (Masked Image Modeling)-Refiner, a contrastive learning boost for pre-trained MIM models. MIM-Refiner is motivated by the insight that strong representations within MIM models generally reside in intermediate layers. Accordingly, MIM-Refiner leverages multiple contrastive heads that are connected to different intermediate layers. In each head, a modified nearest neighbor objective constructs semantic clusters that capture semantic information which improves performance on downstream tasks, including off-the-shelf and fine-tuning settings. The refinement process is short and simple - yet highly effective. Within a few epochs, we refine the features of MIM models from subpar to state-of-the-art, off-the-shelf features. Refining a ViT-H, pre-trained with data2vec 2.0 on ImageNet-1K, sets a new state-of-the-art in linear probing (84.7%) and low-shot classification among models that are pre-trained on ImageNet-1K. MIM-Refiner efficiently combines the advantages of MIM and ID objectives and compares favorably against previous state-of-the-art SSL models on a variety of benchmarks such as low-shot classification, long-tailed classification, clustering and semantic segmentation.