Mozart: Modularized and Efficient MoE Training on 3.5D Wafer-Scale Chiplet Architectures

📅 2026-03-07
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the challenges of deploying Mixture-of-Experts (MoE) large language models on hardware, where sparsity leads to poor memory locality, high communication overhead, and low resource utilization. To overcome these issues, the authors propose Mozart, an algorithm-hardware co-design framework tailored for 3.5D wafer-scale chiplet architectures. Mozart introduces a brain-inspired modular paradigm into MoE training, enabling adaptive co-placement of heterogeneous modules on dedicated chiplets through chiplet-aware expert assignment, streaming token and expert scheduling, a 2.5D NoP-Tree interconnect topology, and a hierarchical memory architecture. Experiments on three mainstream MoE models demonstrate that Mozart significantly improves parallel efficiency and resource utilization, enabling highly efficient large-scale modular MoE-LLM training.

Technology Category

Application Category

📝 Abstract
Mixture-of-Experts (MoE) architecture offers enhanced efficiency for Large Language Models (LLMs) with modularized computation, yet its inherent sparsity poses significant hardware deployment challenges, including memory locality issues, communication overhead, and inefficient computing resource utilization. Inspired by the modular organization of the human brain, we propose Mozart, a novel algorithm-hardware co-design framework tailored for efficient training of MoE-based LLMs on 3.5D wafer-scale chiplet architectures. On the algorithm side, Mozart exploits the inherent modularity of chiplets and introduces: (1) an expert allocation strategy that enables efficient on-package all-to-all communication, and (2) a fine-grained scheduling mechanism that improves communication-computation overlap through streaming tokens and experts. On the architecture side, Mozart adaptively co-locates heterogeneous modules on specialized chiplets with a 2.5D NoP-Tree topology and hierarchical memory structure. Evaluation across three popular MoE models demonstrates significant efficiency gains, enabling more effective parallelization and resource utilization for large-scale modularized MoE-LLMs.
Problem

Research questions and friction points this paper is trying to address.

Mixture-of-Experts
hardware deployment
memory locality
communication overhead
resource utilization
Innovation

Methods, ideas, or system contributions that make the work stand out.

Mixture-of-Experts
chiplet architecture
algorithm-hardware co-design
3.5D wafer-scale integration
communication-computation overlap
🔎 Similar Papers
No similar papers found.
S
Shuqing Luo
University of North Carolina at Chapel Hill
Ye Han
Ye Han
Doctor Candidate, Tongji University
Artifical IntelligenceReinforcement LearningAutonomous DrivingDecision MakingGame Theory
Pingzhi Li
Pingzhi Li
Ph.D. student @UNC-Chapel Hill
Deep Learning
J
Jiayin Qin
University of Minnesota - Twin Cities
J
Jie Peng
University of North Carolina at Chapel Hill
Y
Yang (Katie) Zhao
University of Minnesota - Twin Cities
Y
Yu (Kevin) Cao
University of Minnesota - Twin Cities
Tianlong Chen
Tianlong Chen
Assistant Professor, CS@UNC Chapel Hill; Chief AI Scientist, hireEZ
Machine LearningAI4ScienceComputer VisionSparsity