🤖 AI Summary
Current visual foundation models (VFMs) heavily rely on large-scale labeled datasets, hindering their adoption by resource-constrained institutions. Although domain-specific pre-trained models encode transferable generic visual knowledge, their potential for collaboratively constructing general-purpose VFMs remains systematically unexplored. This paper proposes a model-driven VFM training paradigm: first, aligning multiple teacher models via a shared latent space to mitigate imbalanced knowledge transfer induced by distributional shifts; second, introducing lightweight adapter modules that jointly enable cross-domain knowledge fusion and preservation of generic representations during knowledge distillation and multi-task training. Experiments demonstrate that our approach consistently outperforms mainstream data-driven baselines across four core vision tasks—image classification, object detection, semantic segmentation, and instance segmentation—achieving significant gains in generalization capability and multi-task adaptability.
📝 Abstract
Vision foundation models (VFMs) are predominantly developed using data-centric methods. These methods require training on vast amounts of data usually with high-quality labels, which poses a bottleneck for most institutions that lack both large-scale data and high-end GPUs. On the other hand, many open-source vision models have been pretrained on domain-specific data, enabling them to distill and represent core knowledge in a form that is transferable across diverse applications. Even though these models are highly valuable assets, they remain largely under-explored in empowering the development of a general-purpose VFM. In this paper, we presents a new model-driven approach for training VFMs through joint knowledge transfer and preservation. Our method unifies multiple pre-trained teacher models in a shared latent space to mitigate the ``imbalanced transfer'' issue caused by their distributional gaps. Besides, we introduce a knowledge preservation strategy to take a general-purpose teacher as a knowledge base for integrating knowledge from the remaining purpose-specific teachers using an adapter module. By unifying and aggregating existing models, we build a powerful VFM to inherit teachers' expertise without needing to train on a large amount of labeled data. Our model not only provides generalizable visual features, but also inherently supports multiple downstream tasks. Extensive experiments demonstrate that our VFM outperforms existing data-centric models across four fundamental vision tasks, including image classification, object detection, semantic and instance segmentation.