Revealing the Challenges of Attention-FFN Disaggregation for Modern MoE Models and Hardware Systems

📅 2026-02-10

📈 Citations: 0

✨ Influential: 0

career value

243K/year

🤖 AI Summary

This work addresses the performance limitations of large-scale Mixture-of-Experts (MoE) models, which are constrained by memory and interconnect bandwidth during expert activation. While the Attention-FFN Decoupled (AFD) architecture has been proposed as an alternative to Expert Parallelism (EP), its performance boundaries and the conditions under which it outperforms EP remain unclear. This study extends the communication roofline model to AFD analysis for the first time, integrating hardware FLOPS utilization (HFU), arithmetic intensity, and load imbalance metrics. The analysis systematically reveals a “dead zone” in standard clusters where AFD cannot improve HFU due to bandwidth bottlenecks. AFD demonstrates advantages only under high-bandwidth Superpod-class hardware and coarse-grained, low-sparsity configurations, thereby precisely delineating its applicability and performance potential.

Technology Category

Application Category

📝 Abstract

Deploying large-scale MoE models presents challenges in memory capacity and bandwidth for expert activation. While Attention-FFN Disaggregation (AFD) has emerged as a potential architecture to decouple compute and memory resources, its performance boundaries compared to standard large-scale Expert Parallelism (EP) remain underexplored. In this paper, we conduct a systematic analysis of AFD by extending the roofline model to the communication level, correlating interconnect bandwidth, arithmetic intensity, and Hardware FLOPS Utilization (HFU). Our analysis reveals a dead zone on standard clusters: increasing FFN instance count fails to improve HFU as computational workload is capped by scale-out bandwidth, causing operator active time to shrink relative to the fixed latency budget. We further show that AFD's discrete node-level scaling incurs higher imbalance penalties than EP's continuous batch adjustment. Nevertheless, these limitations diminish under specific conditions: Superpod-class hardware with abundant interconnect bandwidth and models with coarse-grained experts and lower sparsity are more likely to benefit from AFD. These findings position AFD as a promising approach for specific hardware-model combinations rather than a universal solution.

Problem

Research questions and friction points this paper is trying to address.

Attention-FFN Disaggregation

Mixture-of-Experts

Hardware FLOPS Utilization

Expert Parallelism

Interconnect Bandwidth

Innovation

Methods, ideas, or system contributions that make the work stand out.

Attention-FFN Disaggregation

Mixture-of-Experts

Roofline Model