Which Direction to Choose? An Analysis on the Representation Power of Self-Supervised ViTs in Downstream Tasks

📅 2025-09-18

📈 Citations: 0

✨ Influential: 0

career value

149K/year

🤖 AI Summary

This work investigates the direct representational capacity of raw features—i.e., unmodified ViT token embeddings without added projection layers or lightweight heads—in downstream image classification and semantic segmentation. We systematically evaluate, under both standard and few-shot settings, the performance of distinct token types (key, query, value, FFN) paired with simple decision rules (hyperplane-based classification, cosine similarity). Crucially, we provide the first comprehensive analysis of how diverse self-supervised pretraining objectives—including MAE and DINO—affect zero-shot feature transferability. Experiments on ImageNet-1K and ADE20K demonstrate that carefully selected token combinations coupled with minimal decision mechanisms can match or even surpass conventional fine-tuned baselines equipped with task-specific heads. These findings establish theoretical grounding and practical guidelines for efficient, lightweight reuse of ViT features in downstream vision tasks.

Technology Category

Application Category

📝 Abstract

Self-Supervised Learning (SSL) for Vision Transformers (ViTs) has recently demonstrated considerable potential as a pre-training strategy for a variety of computer vision tasks, including image classification and segmentation, both in standard and few-shot downstream contexts. Two pre-training objectives dominate the landscape of SSL techniques: Contrastive Learning and Masked Image Modeling. Features (or tokens) extracted from the final transformer attention block -- specifically, the keys, queries, and values -- as well as features obtained after the final block's feed-forward layer, have become a common foundation for addressing downstream tasks. However, in many existing approaches, these pre-trained ViT features are further processed through additional transformation layers, often involving lightweight heads or combined with distillation, to achieve superior task performance. Although such methods can improve task outcomes, to the best of our knowledge, a comprehensive analysis of the intrinsic representation capabilities of unaltered ViT features has yet to be conducted. This study aims to bridge this gap by systematically evaluating the use of these unmodified features across image classification and segmentation tasks, in both standard and few-shot contexts. The classification and segmentation rules that we use are either hyperplane based (as in logistic regression) or cosine-similarity based, both of which rely on the presence of interpretable directions in the ViT's latent space. Based on the previous rules and without the use of additional feature transformations, we conduct an analysis across token types, tasks, and pre-trained ViT models. This study provides insights into the optimal choice for token type and decision rule based on the task, context, and the pre-training objective, while reporting detailed findings on two widely-used datasets.

Problem

Research questions and friction points this paper is trying to address.

Analyzing intrinsic representation power of unaltered self-supervised ViT features

Evaluating token types and decision rules for downstream tasks

Comparing contrastive learning versus masked image modeling objectives

Innovation

Methods, ideas, or system contributions that make the work stand out.

Unaltered ViT features without extra transformations

Hyperplane and cosine-similarity based decision rules

Analysis across token types, tasks, and models

🔎 Similar Papers

No similar papers found.