How Far Can Disaggregation Go? A Design-Space Exploration of Attention-FFN Disaggregation for Efficient MoE LLM Serving

📅 2026-05-27
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the efficiency bottlenecks in large mixture-of-experts (MoE) language model inference caused by heterogeneous resource demands across attention, expert feed-forward networks (FFNs), and communication modules. It systematically explores the design space of attention–FFN operator-level decoupling (AFD), integrating chunked prefilling and prefill–decode decoupling. Through on-device kernel profiling and high-fidelity network simulation under realistic workloads, the study quantifies the latency and throughput trade-offs of AFD. For the first time in large-scale MoE inference, it identifies the conditions under which operator decoupling is effective and proposes GPU resource allocation principles co-optimized with workload characteristics and model architecture. Experiments show that, under strict TTFT/TPOT service-level objectives, AFD achieves approximately 4k tokens/s throughput on DeepSeek-V3.2, transforming previously infeasible deployments into viable ones and significantly enhancing the joint performance of interactivity and throughput.
📝 Abstract
Modern large language model (LLM) inference has progressively disaggregated to keep pace with growing model sizes and tight TTFT and TPOT service-level objectives: from chunked-prefill aggregation, to prefill-decode (P/D) disaggregation, and most recently to operator-level Attention-FFN Disaggregation (AFD). This trend is especially important for mixture-of-experts (MoE) models, where memory-bound attention, compute-intensive expert FFNs, and MoE dispatch/combine communication create distinct resource demands. AFD further exposes this heterogeneity by placing attention and MoE-FFN execution on separate GPU groups. Each level of disaggregation deepens the scheduling design space across workload characteristics, resource allocation, and interconnect topology, raising the central question: when does each level actually pay off? We systematically characterize this trade-off for MoE inference across realistic workloads spanning input/output sequence lengths, prefix-KV reuse, and per-user latency constraints. Using chunked-prefill and P/D disaggregation as baselines, we study the benefits and limits of AFD at scale through a framework that fuses on-device kernel measurements with high-fidelity network simulation. Under strict TTFT/TPOT SLOs, AFD sustains around 4k tokens/s of system throughput on DeepSeek-V3.2 across chat, coding, and agentic-coding workloads, where non-AFD deployments are infeasible. We distill concrete takeaways for jointly optimizing throughput and interactivity, including how to partition attention and FFN across GPUs as a function of workload and model architecture, providing design principles for current rack- and cluster-scale deployments as well as future disaggregated AI infrastructure.
Problem

Research questions and friction points this paper is trying to address.

Attention-FFN Disaggregation
Mixture-of-Experts
LLM Serving
TTFT/TPOT SLOs
Model Disaggregation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Attention-FFN Disaggregation
Mixture-of-Experts
LLM Inference
Design-Space Exploration
Efficient Serving
🔎 Similar Papers