Seeing Further on the Shoulders of Giants: Knowledge Inheritance for Vision Foundation Models

📅 2025-08-20

📈 Citations: 0

✨ Influential: 0

career value

181K/year

🤖 AI Summary

Current visual foundation models (VFMs) heavily rely on large-scale labeled datasets, hindering their adoption by resource-constrained institutions. Although domain-specific pre-trained models encode transferable generic visual knowledge, their potential for collaboratively constructing general-purpose VFMs remains systematically unexplored. This paper proposes a model-driven VFM training paradigm: first, aligning multiple teacher models via a shared latent space to mitigate imbalanced knowledge transfer induced by distributional shifts; second, introducing lightweight adapter modules that jointly enable cross-domain knowledge fusion and preservation of generic representations during knowledge distillation and multi-task training. Experiments demonstrate that our approach consistently outperforms mainstream data-driven baselines across four core vision tasks—image classification, object detection, semantic segmentation, and instance segmentation—achieving significant gains in generalization capability and multi-task adaptability.

Technology Category

Application Category

📝 Abstract

Vision foundation models (VFMs) are predominantly developed using data-centric methods. These methods require training on vast amounts of data usually with high-quality labels, which poses a bottleneck for most institutions that lack both large-scale data and high-end GPUs. On the other hand, many open-source vision models have been pretrained on domain-specific data, enabling them to distill and represent core knowledge in a form that is transferable across diverse applications. Even though these models are highly valuable assets, they remain largely under-explored in empowering the development of a general-purpose VFM. In this paper, we presents a new model-driven approach for training VFMs through joint knowledge transfer and preservation. Our method unifies multiple pre-trained teacher models in a shared latent space to mitigate the ``imbalanced transfer'' issue caused by their distributional gaps. Besides, we introduce a knowledge preservation strategy to take a general-purpose teacher as a knowledge base for integrating knowledge from the remaining purpose-specific teachers using an adapter module. By unifying and aggregating existing models, we build a powerful VFM to inherit teachers' expertise without needing to train on a large amount of labeled data. Our model not only provides generalizable visual features, but also inherently supports multiple downstream tasks. Extensive experiments demonstrate that our VFM outperforms existing data-centric models across four fundamental vision tasks, including image classification, object detection, semantic and instance segmentation.

Problem

Research questions and friction points this paper is trying to address.

Leveraging existing models to reduce data dependency

Mitigating knowledge transfer imbalance between models

Integrating specialized knowledge into general foundation models

Innovation

Methods, ideas, or system contributions that make the work stand out.

Unifies multiple teacher models in shared latent space

Uses adapter module for knowledge integration

Aggregates existing models without large labeled data

🔎 Similar Papers

No similar papers found.