AMoE: Agglomerative Mixture-of-Experts Vision Foundation Model

📅 2025-12-23

📈 Citations: 0

✨ Influential: 0

career value

210K/year

🤖 AI Summary

To address the unclear knowledge transfer dynamics, low data efficiency, and high computational cost of multi-teacher distillation in vision foundation model training, this paper proposes an Aggregated Mixture-of-Experts (MoE) architecture that jointly distills complementary knowledge from SigLIP2 and DINOv3. We introduce an asymmetric relational knowledge distillation loss to enhance geometric fidelity, and design a token-balanced batching mechanism coupled with hierarchical clustering-based sampling to improve training stability and sample efficiency. Evaluated on our newly constructed OpenLVD200M dataset—comprising 200 million images—the proposed model achieves significant gains in cross-resolution representation capability and data utilization efficiency. All code, pretrained models, and the OpenLVD200M dataset are publicly released.

Technology Category

Application Category

📝 Abstract

Vision foundation models trained via multi-teacher distillation offer a promising path toward unified visual representations, yet the learning dynamics and data efficiency of such approaches remain underexplored. In this paper, we systematically study multi-teacher distillation for vision foundation models and identify key factors that enable training at lower computational cost. We introduce Agglomerative Mixture-of-Experts Vision Foundation Models (AMoE), which distill knowledge from SigLIP2 and DINOv3 simultaneously into a Mixture-of-Experts student. We show that (1) our Asymmetric Relation-Knowledge Distillation loss preserves the geometric properties of each teacher while enabling effective knowledge transfer, (2) token-balanced batching that packs varying-resolution images into sequences with uniform token budgets stabilizes representation learning across resolutions without sacrificing performance, and (3) hierarchical clustering and sampling of training data--typically reserved for self-supervised learning--substantially improves sample efficiency over random sampling for multi-teacher distillation. By combining these findings, we curate OpenLVD200M, a 200M-image corpus that demonstrates superior efficiency for multi-teacher distillation. Instantiated in a Mixture-of-Experts. We release OpenLVD200M and distilled models.

Problem

Research questions and friction points this paper is trying to address.

Enhances multi-teacher distillation efficiency for vision models

Improves sample efficiency via hierarchical data clustering and sampling

Stabilizes representation learning across image resolutions with token-balanced batching

Innovation

Methods, ideas, or system contributions that make the work stand out.

Multi-teacher distillation from SigLIP2 and DINOv3 into a Mixture-of-Experts student

Asymmetric Relation-Knowledge Distillation loss preserves teacher geometric properties

Token-balanced batching and hierarchical data clustering improve efficiency and stability

🔎 Similar Papers

On Expert Estimation in Hierarchical Mixture of Experts: Beyond Softmax Gating Functions