🤖 AI Summary
This work addresses the complexity of task-specific fine-tuning in few-shot learning by proposing a non-parametric, backpropagation-free approach that directly leverages frozen DINOv2-L general-purpose representations for classification. Built upon the hypothesis that sufficiently powerful generic embeddings can obviate the need for intricate fine-tuning, the method enhances feature quality through optimal layer selection and manifold refinement—combining PCA and ICA—and employs a k-nearest neighbors classifier for efficient inference. Evaluated on four established few-shot benchmarks, the approach surpasses existing meta-learning methods, achieving state-of-the-art performance and providing the first systematic validation of the substantial potential of high-quality frozen representations in few-shot scenarios.
📝 Abstract
The field of deep visual recognition is undergoing a paradigm shift toward universal representations. The Platonic Representation Hypothesis suggests that diverse architectures trained on massive datasets are converging toward a shared, "ideal" latent space. This again raises a critical question: is a "Good Embedding All You Need?" In this paper, we leverage this convergence to demonstrate that off-the-shelf embeddings are inherently "good enough" for complex tasks, rendering intensive task-specific fine-tuning unnecessary. We explore this hypothesis within the few-shot learning framework, proposing a straightforward, non-parametric pipeline that entirely bypasses backpropagation. By utilizing a k-Nearest Neighbor classifier on frozen DINOv2-L features, we conduct a layer-wise characterization to identify an optimal feature extraction. We further demonstrate that manifold refinement via PCA and ICA provides a beneficial regularizing effect. Our results across four major benchmarks demonstrate that our approach consistently surpasses sophisticated meta-learning algorithms, achieving state-of-the-art performance.