Theoretically Optimal Attention/FFN Ratios in Disaggregated LLM Serving

📅 2026-01-29
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the inefficiencies in large language model inference caused by suboptimal resource allocation in attention–feedforward network (AFD) decoupled architectures, where imbalanced provisioning leads to step-level blocking and device idling. The study presents the first joint probabilistic model that captures the dynamic characteristics of non-stationary attention workloads and stable feedforward network (FFN) batching behavior. Leveraging queueing theory and closed-form optimization, it derives the optimal Attention/FFN resource allocation ratio that maximizes system throughput. Validation via an AFD simulator calibrated with real-world traces demonstrates that the theoretically derived optimal ratio deviates by less than 10% from empirical results across diverse workloads, substantially reducing device idle time and enhancing throughput efficiency—thereby overcoming the limitations of conventional static allocation strategies.

Technology Category

Application Category

📝 Abstract
Attention-FFN disaggregation (AFD) is an emerging architecture for LLM decoding that separates state-heavy, KV-cache-dominated Attention computation from stateless, compute-intensive FFN computation, connected by per-step communication. While AFD enables independent scaling of memory and compute resources, its performance is highly sensitive to the Attention/FFN provisioning ratio: mis-sizing induces step-level blocking and costly device idle time. We develop a tractable analytical framework for sizing AFD bundles in an $r$A-$1$F topology, where the key difficulty is that Attention-side work is nonstationary-token context grows and requests are continuously replenished with random lengths-while FFN work is stable given the aggregated batch. Using a probabilistic workload model, we derive closed-form rules for the optimal A/F ratio that maximize average throughput per instance across the system. A trace-calibrated AFD simulator validates the theory: across workloads, the theoretical optimal A/F ratio matches the simulation-optimal within 10%, and consistently reduces idle time.
Problem

Research questions and friction points this paper is trying to address.

Attention-FFN disaggregation
LLM serving
resource provisioning
throughput optimization
A/F ratio
Innovation

Methods, ideas, or system contributions that make the work stand out.

Attention-FFN Disaggregation
Optimal Resource Ratio
Nonstationary Workload Modeling
Closed-form Throughput Optimization
LLM Serving Architecture
🔎 Similar Papers
No similar papers found.
C
Chendong Song
1Department of Industrial Engineering and Decision Analytics, HKUST, Clear Water Bay, Hongkong, China
M
Meixuan Wang
2Department of Computer Science and Technology, Tsinghua University, Haidian District, 100084, Beijing, China
Hang Zhou
Hang Zhou
Tsinghua University
Hong Liang
Hong Liang
Aramco Americas
Y
Yuan Lyu
4Huawei Hong Kong Research Center, Hong Kong, China.
Z
Zixi Chen
5School of Mathematical Sciences, Peking University, Yiheyuan Road, 100871, Beijing, China.
Y
Yuwei Fan
4Huawei Hong Kong Research Center, Hong Kong, China.
Z
Zijie Zhou
1Department of Industrial Engineering and Decision Analytics, HKUST, Clear Water Bay, Hongkong, China