🤖 AI Summary
This work identifies a "semantic selection gap" in few-shot semantic segmentation when using DINOv3, where suboptimal feature layer selection—particularly the common reliance on the default last layer—limits performance. Existing unsupervised or support-guided feature selection strategies struggle to surpass this baseline. To address this, the authors propose FSSDINO, a training-free method that operates with a frozen DINOv3 backbone, refining features by integrating class prototypes with Gram matrix representations and employing oracle-guided analysis to evaluate the semantic potential of individual layers. Experiments demonstrate that leveraging only the last-layer features achieves performance comparable to more complex adaptation approaches across binary, multi-class, and cross-domain few-shot segmentation benchmarks. These findings establish the last layer as a strong baseline and reveal the limitations of current feature selection strategies in this context.
📝 Abstract
Recent self-supervised Vision Transformers (ViTs), such as DINOv3, provide rich feature representations for dense vision tasks. This study investigates the intrinsic few-shot semantic segmentation (FSS) capabilities of frozen DINOv3 features through a training-free baseline, FSSDINO, utilizing class-specific prototypes and Gram-matrix refinement. Our results across binary, multi-class, and cross-domain (CDFSS) benchmarks demonstrate that this minimal approach, applied to the final backbone layer, is highly competitive with specialized methods involving complex decoders or test-time adaptation. Crucially, we conduct an Oracle-guided layer analysis, identifying a significant performance gap between the standard last-layer features and globally optimal intermediate representations. We reveal a"Safest vs. Optimal"dilemma: while the Oracle proves higher performance is attainable, matching the results of compute-intensive adaptation methods, current unsupervised and support-guided selection metrics consistently yield lower performance than the last-layer baseline. This characterizes a"Semantic Selection Gap"in Foundation Models, a disconnect where traditional heuristics fail to reliably identify high-fidelity features. Our work establishes the"Last-Layer"as a deceptively strong baseline and provides a rigorous diagnostic of the latent semantic potentials in DINOv3.The code is publicly available at https://github.com/hussni0997/fssdino.