Accessing Vision Foundation Models via ImageNet-1K

📅 2024-07-15
📈 Citations: 1
Influential: 1
📄 PDF

career value

194K/year
🤖 AI Summary
Vision foundation model distillation typically requires access to the teacher’s original training data and substantial computational resources, posing high data and hardware barriers. Method: This paper proposes Proteus—a data-efficient distillation framework that operates solely on ImageNet-1K (1.2M images) without requiring the teacher’s training data. It introduces a novel three-level knowledge transfer objective (token-, patch-, and feature-level) to mitigate dataset bias, integrated with multi-granularity feature alignment, self-supervised representation transfer, and a lightweight ViT architecture. Results: Proteus-L/14 matches DINOv2-L/14’s performance across 19 downstream benchmarks—despite DINOv2’s reliance on 142M images—and outperforms larger models like CLIP-L/14. Training cost is reduced to ImageNet-scale, significantly improving model accessibility and deployment efficiency.

Technology Category

Application Category

📝 Abstract
Vision foundation models are renowned for the generalization ability due to massive training data. Nevertheless, they demand tremendous training resources, and the training data is often inaccessible, e.g., CLIP, DINOv2, posing great challenges to developing derivatives that could facilitate the research. In this work, we offer a very simple and general solution, named extit{Proteus}, to distill foundation models into smaller equivalents on ImageNet-1K without access to the original training data. Specifically, we remove the designs from conventional knowledge distillation settings that result in dataset bias and present three levels of training objectives, i.e., token, patch, and feature, to maximize the efficacy of knowledge transfer. In this manner, Proteus is trained at ImageNet-level costs with surprising ability, facilitating the accessibility of training foundation models for the broader research community. When leveraging DINOv2-g/14 as the teacher, Proteus-L/14 matches the performance of the Oracle method DINOv2-L/14 (142M training data) across 19 benchmarks and outperforms other vision foundation models including CLIP-L/14 (400M), OpenCLIP-L/14 (400M/2B) and SynCLR-L/14 (600M) with a significantly smaller training set of 1.2M images.
Problem

Research questions and friction points this paper is trying to address.

Distill vision foundation models efficiently
Reduce training resources and data dependency
Enhance model accessibility for broader research
Innovation

Methods, ideas, or system contributions that make the work stand out.

Distills foundation models on ImageNet-1K
Utilizes token, patch, feature objectives
Reduces training data to 1.2M images
🔎 Similar Papers
No similar papers found.
Yitian Zhang
Yitian Zhang
Northeastern University
computer vision
X
Xu Ma
Department of Electrical and Computer Engineering, Northeastern University
Yue Bai
Yue Bai
Northwestern University, Northeastern University
Multi-modal learningSparse network trainingMask learning
H
Huan Wang
Department of Electrical and Computer Engineering, Northeastern University
Y
Yun Fu
Khoury College of Computer Science, Northeastern University