🤖 AI Summary
This work addresses the scalability limitations of centralized large language model (LLM) services and the underutilization of globally distributed GPU resources. Existing decentralized approaches often overlook competitive dynamics among participants and rely on unrealistic assumptions. To overcome these issues, we propose a novel decentralized framework that, for the first time, explicitly accounts for participant autonomy and competition in LLM serving. Our approach eliminates reliance on fixed software-hardware stacks and strong central coordination, instead leveraging a decentralized network architecture, a self-organizing request scheduling algorithm, and a flexible resource commitment mechanism to enable autonomous collaboration in heterogeneous environments. Experimental results demonstrate that our method improves global service-level objective attainment by up to 1.5×, reduces latency by 27.6%, and matches or surpasses the performance of centralized schedulers while preserving the inherent benefits of decentralization.
📝 Abstract
Large language model (LLM) services are mostly centralized, leading to scalability bottlenecks and underutilization of substantial scattered GPU resources. While decentralization offers a promising alternative, existing frameworks primarily focus on cooperation among GPU providers while overlooking their inherent competitive dynamics, imposing substantial constraints such as excessive platform-level oversight or rigid requirements to execute all assigned requests using fixed software stacks on fixed hardware configurations. We argue that such assumptions are unrealistic in real-world decentralized environments. To this end, we propose WWW.Serve, a decentralized framework for interconnecting LLM services worldwide. It allows participants to flexibly determine their participation policies and resource commitments, and supports self-organizing request dispatch, enabling the network to autonomously allocate requests without centralized coordination. Empirically, we show that WWW.Serve improves global SLO (service-level-objective) attainment by up to 1.5x and lowers latency by 27.6%. Its performance approaches, and in some cases surpasses, centralized scheduling, while fully preserving the benefits of decentralization. These results highlight WWW.Serve as a promising foundation for real-world, decentralized LLM serving.