π€ AI Summary
In distributed large language model inference, all-to-all communication among experts has become a critical performance bottleneck. This work proposes the Federated-of-Experts (FoE) architecture, which, for the first time, partitions Mixture-of-Experts (MoE) into expert clusters aligned with attention headsβeach cluster processes only a single key-value head. By integrating intra-cluster expert parallelism with a residual synchronous routing mechanism, FoE entirely eliminates all-to-all communication within a single node and confines inter-node communication strictly to within-node boundaries in multi-node settings. Experiments demonstrate that FoE reduces end-to-end forward latency on LongBench by up to 5.2Γ, with first-token and per-token latencies improved by 3.62Γ and 1.95Γ, respectively, while maintaining comparable generation quality.
π Abstract
Mixture of experts has emerged as the primary mechanism for making Large Language Models (LLMs) computationally efficient. However, in distributed settings, communicating token embeddings between experts is a significant bottleneck.
We present the novel Federation of Experts (FoE) architecture. FoE restructures the MoE block of a transformer layer into multiple MoE clusters. Each cluster is responsible for only one of the KV heads and expert parallelism is applied between those experts. Between clusters, a sum synchronizes the post-attention residuals, which then drives routing and dispatch for the next MoE block. In a single-node setting, FoE completely eliminates all-to-all communication as all experts within a group are contained on the same GPU. In multi-node settings, FoE confines all-to-all communication to the intra-node fabric, thus significantly reducing communication overhead.
An implementation of FoE finds that on LongBench, FoE significantly improves inference throughput and latency in both single-node and multi-node settings, reducing the end-to-end forward-pass latency by up to 5.2x, TTFT by 3.62x, and TBT by 1.95x. It does so while achieving comparable generation quality to a mixture of experts model of the same size and training configuration.