AMoE: Agglomerative Mixture-of-Experts Vision Foundation Model

📅 2025-12-23
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the unclear knowledge transfer dynamics, low data efficiency, and high computational cost of multi-teacher distillation in vision foundation model training, this paper proposes an Aggregated Mixture-of-Experts (MoE) architecture that jointly distills complementary knowledge from SigLIP2 and DINOv3. We introduce an asymmetric relational knowledge distillation loss to enhance geometric fidelity, and design a token-balanced batching mechanism coupled with hierarchical clustering-based sampling to improve training stability and sample efficiency. Evaluated on our newly constructed OpenLVD200M dataset—comprising 200 million images—the proposed model achieves significant gains in cross-resolution representation capability and data utilization efficiency. All code, pretrained models, and the OpenLVD200M dataset are publicly released.

Technology Category

Application Category

📝 Abstract
Vision foundation models trained via multi-teacher distillation offer a promising path toward unified visual representations, yet the learning dynamics and data efficiency of such approaches remain underexplored. In this paper, we systematically study multi-teacher distillation for vision foundation models and identify key factors that enable training at lower computational cost. We introduce Agglomerative Mixture-of-Experts Vision Foundation Models (AMoE), which distill knowledge from SigLIP2 and DINOv3 simultaneously into a Mixture-of-Experts student. We show that (1) our Asymmetric Relation-Knowledge Distillation loss preserves the geometric properties of each teacher while enabling effective knowledge transfer, (2) token-balanced batching that packs varying-resolution images into sequences with uniform token budgets stabilizes representation learning across resolutions without sacrificing performance, and (3) hierarchical clustering and sampling of training data--typically reserved for self-supervised learning--substantially improves sample efficiency over random sampling for multi-teacher distillation. By combining these findings, we curate OpenLVD200M, a 200M-image corpus that demonstrates superior efficiency for multi-teacher distillation. Instantiated in a Mixture-of-Experts. We release OpenLVD200M and distilled models.
Problem

Research questions and friction points this paper is trying to address.

Enhances multi-teacher distillation efficiency for vision models
Improves sample efficiency via hierarchical data clustering and sampling
Stabilizes representation learning across image resolutions with token-balanced batching
Innovation

Methods, ideas, or system contributions that make the work stand out.

Multi-teacher distillation from SigLIP2 and DINOv3 into a Mixture-of-Experts student
Asymmetric Relation-Knowledge Distillation loss preserves teacher geometric properties
Token-balanced batching and hierarchical data clustering improve efficiency and stability
S
Sofian Chaybouti
Technology Innovation Institute, Abu Dhabi, UAE
Sanath Narayan
Sanath Narayan
Technology Innovation Institute, Abu Dhabi
Computer VisionMachine Learning
Yasser Dahou
Yasser Dahou
Dublin City University, Technology Innovation Institute
Deep learningVision Language ModelsVisual Attention modelling
P
Phúc H. Lê Khac
Technology Innovation Institute, Abu Dhabi, UAE
A
Ankit Singh
Technology Innovation Institute, Abu Dhabi, UAE
N
Ngoc Dung Huynh
Technology Innovation Institute, Abu Dhabi, UAE
W
Wamiq Reyaz Para
Technology Innovation Institute, Abu Dhabi, UAE
Hilde Kuehne
Hilde Kuehne
Tuebingen AI Center, University of Tuebingen, MIT-IBM Watson Lab
Multimodal learningVideo understandingAction RecognitionComputer visionMachine learning
Hakim Hacid
Hakim Hacid
Technology Innovation Institute (TII), UAE
Machine LearningLLMDatabasesInformation RetrievalEdge ML