Revealing the Challenges of Attention-FFN Disaggregation for Modern MoE Models and Hardware Systems

πŸ“… 2026-02-10
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
This work addresses the performance limitations of large-scale Mixture-of-Experts (MoE) models, which are constrained by memory and interconnect bandwidth during expert activation. While the Attention-FFN Decoupled (AFD) architecture has been proposed as an alternative to Expert Parallelism (EP), its performance boundaries and the conditions under which it outperforms EP remain unclear. This study extends the communication roofline model to AFD analysis for the first time, integrating hardware FLOPS utilization (HFU), arithmetic intensity, and load imbalance metrics. The analysis systematically reveals a β€œdead zone” in standard clusters where AFD cannot improve HFU due to bandwidth bottlenecks. AFD demonstrates advantages only under high-bandwidth Superpod-class hardware and coarse-grained, low-sparsity configurations, thereby precisely delineating its applicability and performance potential.

Technology Category

Application Category

πŸ“ Abstract
Deploying large-scale MoE models presents challenges in memory capacity and bandwidth for expert activation. While Attention-FFN Disaggregation (AFD) has emerged as a potential architecture to decouple compute and memory resources, its performance boundaries compared to standard large-scale Expert Parallelism (EP) remain underexplored. In this paper, we conduct a systematic analysis of AFD by extending the roofline model to the communication level, correlating interconnect bandwidth, arithmetic intensity, and Hardware FLOPS Utilization (HFU). Our analysis reveals a dead zone on standard clusters: increasing FFN instance count fails to improve HFU as computational workload is capped by scale-out bandwidth, causing operator active time to shrink relative to the fixed latency budget. We further show that AFD's discrete node-level scaling incurs higher imbalance penalties than EP's continuous batch adjustment. Nevertheless, these limitations diminish under specific conditions: Superpod-class hardware with abundant interconnect bandwidth and models with coarse-grained experts and lower sparsity are more likely to benefit from AFD. These findings position AFD as a promising approach for specific hardware-model combinations rather than a universal solution.
Problem

Research questions and friction points this paper is trying to address.

Attention-FFN Disaggregation
Mixture-of-Experts
Hardware FLOPS Utilization
Expert Parallelism
Interconnect Bandwidth
Innovation

Methods, ideas, or system contributions that make the work stand out.

Attention-FFN Disaggregation
Mixture-of-Experts
Roofline Model
Hardware FLOPS Utilization
Interconnect Bandwidth
πŸ”Ž Similar Papers
No similar papers found.
G
Guowei Liu
Baige AI Team, Baidu Inc.
H
Hongming Li
Baige AI Team, Baidu Inc.
Y
Yaning Guo
Baige AI Team, Baidu Inc.
Y
Yongxi Lyu
Baige AI Team, Baidu Inc.
M
Mo Zhou
Baige AI Team, Baidu Inc.
Yi Liu
Yi Liu
Baidu Inc.
CVLLMVLM
Z
Zhaogeng Li
Baige AI Team, Baidu Inc.
Y
Yanpeng Wang
Baige AI Team, Baidu Inc.