🤖 AI Summary
This work addresses the performance bottleneck caused by inter-node communication in internet-scale distributed inference of large language models (LLMs), where limited bandwidth often constrains system efficiency. To overcome this challenge, the authors propose BloomBee, a novel framework that jointly models layer placement, micro-batching, and tensor offloading as a unified co-optimization problem and solves it optimally via dynamic programming. Additionally, BloomBee incorporates lossless compression and speculative decoding mechanisms specifically tailored for low-bandwidth environments. Experimental results demonstrate that BloomBee achieves up to a 1.76× improvement in serving throughput and reduces average latency by as much as 43.20% across diverse network conditions.
📝 Abstract
Decentralized LLM inference distributes computation among heterogeneous nodes across the internet, offering a performant and cost-efficient solution, alternative to traditional centralized inference. However, the low cross-node network bandwidth makes communication the primary bottleneck. In this paper, we introduce BloomBee, an internet-scale distributed LLM inference framework. BloomBee integrates LLM-layer assignment, micro-batching and tensor offloading to optimize communication from multiple dimensions. Additionally, BloomBee formulates the coordination of these techniques as an optimization problem and solves it using dynamic programming. BloomBee also customizes lossless compression and speculative decoding according to low-bandwidth network settings to reduce communication overhead. We evaluate BloomBee across a spectrum of network environments and show that it improves service throughput by up to 1.76x. It also reduces average latency by up to 43.20% compared to state-of-the-art decentralized LLM inference systems. BloomBee is open-sourced.