Distributed Generative Inference of LLM at Internet Scales with Multi-Dimensional Communication Optimization

📅 2026-04-22
📈 Citations: 0
Influential: 0
📄 PDF

career value

215K/year
🤖 AI Summary
This work addresses the performance bottleneck caused by inter-node communication in internet-scale distributed inference of large language models (LLMs), where limited bandwidth often constrains system efficiency. To overcome this challenge, the authors propose BloomBee, a novel framework that jointly models layer placement, micro-batching, and tensor offloading as a unified co-optimization problem and solves it optimally via dynamic programming. Additionally, BloomBee incorporates lossless compression and speculative decoding mechanisms specifically tailored for low-bandwidth environments. Experimental results demonstrate that BloomBee achieves up to a 1.76× improvement in serving throughput and reduces average latency by as much as 43.20% across diverse network conditions.

Technology Category

Application Category

📝 Abstract
Decentralized LLM inference distributes computation among heterogeneous nodes across the internet, offering a performant and cost-efficient solution, alternative to traditional centralized inference. However, the low cross-node network bandwidth makes communication the primary bottleneck. In this paper, we introduce BloomBee, an internet-scale distributed LLM inference framework. BloomBee integrates LLM-layer assignment, micro-batching and tensor offloading to optimize communication from multiple dimensions. Additionally, BloomBee formulates the coordination of these techniques as an optimization problem and solves it using dynamic programming. BloomBee also customizes lossless compression and speculative decoding according to low-bandwidth network settings to reduce communication overhead. We evaluate BloomBee across a spectrum of network environments and show that it improves service throughput by up to 1.76x. It also reduces average latency by up to 43.20% compared to state-of-the-art decentralized LLM inference systems. BloomBee is open-sourced.
Problem

Research questions and friction points this paper is trying to address.

Distributed LLM Inference
Communication Bottleneck
Low-Bandwidth Networks
Decentralized Inference
Internet-Scale
Innovation

Methods, ideas, or system contributions that make the work stand out.

distributed LLM inference
multi-dimensional communication optimization
tensor offloading
speculative decoding
dynamic programming
🔎 Similar Papers