Mozart: Modularized and Efficient MoE Training on 3.5D Wafer-Scale Chiplet Architectures

📅 2026-03-07

📈 Citations: 0

✨ Influential: 0

career value

243K/year

🤖 AI Summary

This work addresses the challenges of deploying Mixture-of-Experts (MoE) large language models on hardware, where sparsity leads to poor memory locality, high communication overhead, and low resource utilization. To overcome these issues, the authors propose Mozart, an algorithm-hardware co-design framework tailored for 3.5D wafer-scale chiplet architectures. Mozart introduces a brain-inspired modular paradigm into MoE training, enabling adaptive co-placement of heterogeneous modules on dedicated chiplets through chiplet-aware expert assignment, streaming token and expert scheduling, a 2.5D NoP-Tree interconnect topology, and a hierarchical memory architecture. Experiments on three mainstream MoE models demonstrate that Mozart significantly improves parallel efficiency and resource utilization, enabling highly efficient large-scale modular MoE-LLM training.

Technology Category

Application Category

📝 Abstract

Mixture-of-Experts (MoE) architecture offers enhanced efficiency for Large Language Models (LLMs) with modularized computation, yet its inherent sparsity poses significant hardware deployment challenges, including memory locality issues, communication overhead, and inefficient computing resource utilization. Inspired by the modular organization of the human brain, we propose Mozart, a novel algorithm-hardware co-design framework tailored for efficient training of MoE-based LLMs on 3.5D wafer-scale chiplet architectures. On the algorithm side, Mozart exploits the inherent modularity of chiplets and introduces: (1) an expert allocation strategy that enables efficient on-package all-to-all communication, and (2) a fine-grained scheduling mechanism that improves communication-computation overlap through streaming tokens and experts. On the architecture side, Mozart adaptively co-locates heterogeneous modules on specialized chiplets with a 2.5D NoP-Tree topology and hierarchical memory structure. Evaluation across three popular MoE models demonstrates significant efficiency gains, enabling more effective parallelization and resource utilization for large-scale modularized MoE-LLMs.

Problem

Research questions and friction points this paper is trying to address.

Mixture-of-Experts

hardware deployment

memory locality

communication overhead

resource utilization

Innovation

Methods, ideas, or system contributions that make the work stand out.

Mixture-of-Experts

chiplet architecture

algorithm-hardware co-design