Federation of Experts: Communication Efficient Distributed Inference for Large Language Models

πŸ“… 2026-05-07
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF

career value

244K/year
πŸ€– AI Summary
In distributed large language model inference, all-to-all communication among experts has become a critical performance bottleneck. This work proposes the Federated-of-Experts (FoE) architecture, which, for the first time, partitions Mixture-of-Experts (MoE) into expert clusters aligned with attention headsβ€”each cluster processes only a single key-value head. By integrating intra-cluster expert parallelism with a residual synchronous routing mechanism, FoE entirely eliminates all-to-all communication within a single node and confines inter-node communication strictly to within-node boundaries in multi-node settings. Experiments demonstrate that FoE reduces end-to-end forward latency on LongBench by up to 5.2Γ—, with first-token and per-token latencies improved by 3.62Γ— and 1.95Γ—, respectively, while maintaining comparable generation quality.
πŸ“ Abstract
Mixture of experts has emerged as the primary mechanism for making Large Language Models (LLMs) computationally efficient. However, in distributed settings, communicating token embeddings between experts is a significant bottleneck. We present the novel Federation of Experts (FoE) architecture. FoE restructures the MoE block of a transformer layer into multiple MoE clusters. Each cluster is responsible for only one of the KV heads and expert parallelism is applied between those experts. Between clusters, a sum synchronizes the post-attention residuals, which then drives routing and dispatch for the next MoE block. In a single-node setting, FoE completely eliminates all-to-all communication as all experts within a group are contained on the same GPU. In multi-node settings, FoE confines all-to-all communication to the intra-node fabric, thus significantly reducing communication overhead. An implementation of FoE finds that on LongBench, FoE significantly improves inference throughput and latency in both single-node and multi-node settings, reducing the end-to-end forward-pass latency by up to 5.2x, TTFT by 3.62x, and TBT by 1.95x. It does so while achieving comparable generation quality to a mixture of experts model of the same size and training configuration.
Problem

Research questions and friction points this paper is trying to address.

Mixture of Experts
Large Language Models
Distributed Inference
Communication Bottleneck
Token Embeddings
Innovation

Methods, ideas, or system contributions that make the work stand out.

Federation of Experts
Mixture of Experts
Communication Efficiency
Distributed Inference
Large Language Models
πŸ”Ž Similar Papers