🤖 AI Summary
Vision foundation model distillation typically requires access to the teacher’s original training data and substantial computational resources, posing high data and hardware barriers. Method: This paper proposes Proteus—a data-efficient distillation framework that operates solely on ImageNet-1K (1.2M images) without requiring the teacher’s training data. It introduces a novel three-level knowledge transfer objective (token-, patch-, and feature-level) to mitigate dataset bias, integrated with multi-granularity feature alignment, self-supervised representation transfer, and a lightweight ViT architecture. Results: Proteus-L/14 matches DINOv2-L/14’s performance across 19 downstream benchmarks—despite DINOv2’s reliance on 142M images—and outperforms larger models like CLIP-L/14. Training cost is reduced to ImageNet-scale, significantly improving model accessibility and deployment efficiency.
📝 Abstract
Vision foundation models are renowned for the generalization ability due to massive training data. Nevertheless, they demand tremendous training resources, and the training data is often inaccessible, e.g., CLIP, DINOv2, posing great challenges to developing derivatives that could facilitate the research. In this work, we offer a very simple and general solution, named extit{Proteus}, to distill foundation models into smaller equivalents on ImageNet-1K without access to the original training data. Specifically, we remove the designs from conventional knowledge distillation settings that result in dataset bias and present three levels of training objectives, i.e., token, patch, and feature, to maximize the efficacy of knowledge transfer. In this manner, Proteus is trained at ImageNet-level costs with surprising ability, facilitating the accessibility of training foundation models for the broader research community. When leveraging DINOv2-g/14 as the teacher, Proteus-L/14 matches the performance of the Oracle method DINOv2-L/14 (142M training data) across 19 benchmarks and outperforms other vision foundation models including CLIP-L/14 (400M), OpenCLIP-L/14 (400M/2B) and SynCLR-L/14 (600M) with a significantly smaller training set of 1.2M images.