Chain-of-Experts: Unlocking the Communication Power of Mixture-of-Experts Models

📅 2025-06-22
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Conventional Mixture-of-Experts (MoE) models deploy experts in parallel and independently, lacking inter-expert collaboration, which limits representational capacity. Method: We propose Chain-of-Experts (CoE), a novel architecture that organizes experts into a chain structure within each layer. CoE employs intra-layer multi-step routing and a dynamic iterative routing mechanism to enable tokens to traverse and be re-allocated across multiple experts in several steps. It further introduces an iterative residual structure and a dynamic expert selection algorithm, enabling expansion along a new dimension—“expert iteration depth”—without increasing computational cost. Results: Experiments on mathematical reasoning tasks show that, under fixed FLOPs, CoE reduces validation loss from 1.20 to 1.12. With only two iterations, it matches the performance of a 3× wider MoE while reducing memory overhead by 17.6%–42%. CoE significantly enhances collaborative efficiency and resource utilization.

Technology Category

Application Category

📝 Abstract
We propose Chain-of-Experts (CoE), a new Mixture-of-Experts (MoE) architecture that introduces sequential expert communication within each layer. Unlike traditional MoE models, where experts operate independently in parallel, CoE processes tokens iteratively across a chain of experts inside a layer. To support dynamic expert selection across iterations, CoE employs a dedicated router at each iteration step within a layer. This design allows tokens to re-evaluate and select different experts during each iteration, rather than being statically assigned. As a result, CoE introduces a flexible routing mechanism that increases the diversity of expert combinations and enriches the model's representational capacity. CoE demonstrates improved performance under fixed compute: on math reasoning tasks, it reduces validation loss from 1.20 to 1.12 compared to a standard MoE. Beyond performance, CoE offers a new scaling axis: depth through expert iteration, which complements conventional width/depth scaling. For example, using 2x iterations matches the performance of 3x expert selections (in width), while reducing memory usage by 17.6-42% relative to other scaling strategies. Our analysis reveals that CoE's benefits stem from its iterative residual structure and enhanced expert specialization empowered by iterative routing, which together unlock more expressive representations. Code is available at https://github.com/ZihanWang314/coe.
Problem

Research questions and friction points this paper is trying to address.

Enabling sequential expert communication in MoE models
Dynamic expert selection via iterative routing mechanism
Improving performance and reducing memory usage in scaling
Innovation

Methods, ideas, or system contributions that make the work stand out.

Sequential expert communication within layers
Dynamic expert selection via iterative routing
Iterative residual structure enhances specialization
🔎 Similar Papers
No similar papers found.
Z
Zihan Wang
Northwestern University
R
Rui Pan
University of Illinois Urbana-Champaign
Jiarui Yao
Jiarui Yao
CS, UIUC
Reinforcement LearningMachine LearningLarge Language Models
R
Robert Csordas
Stanford University
Linjie Li
Linjie Li
Microsoft
Vision and Language
L
Lu Yin
University of Surrey
J
Jiajun Wu
Stanford University
T
Tong Zhang
University of Illinois Urbana-Champaign
Manling Li
Manling Li
Assistant Professor at Northwestern University
Natural Language ProcessingVision-LanguageEmbodied Agents
S
Shiwei Liu
University of Oxford