🤖 AI Summary
Serving large language models (LLMs) on consumer-grade GPUs in low-bandwidth, heterogeneous host environments remains challenging due to communication bottlenecks and inefficient resource utilization. Method: This paper proposes a lightweight distributed inference framework featuring cross-public-network and Ethernet-coordinated tensor parallelism, dynamic load balancing, heterogeneous memory-aware scheduling, and a low-overhead communication protocol—enabling the first multi-GPU aggregated inference under weak-network conditions. Contribution/Results: The framework supports 18 mainstream open-source LLMs and offers plug-and-play deployment across Windows, Linux, and containerized VMs. Experiments demonstrate up to 458% higher throughput and a 151% increase in profit margin per unit computational cost compared to state-of-the-art systems, significantly lowering both the technical barrier and operational cost of LLM serving.
📝 Abstract
Large language models represent a groundbreaking shift in generative AI. Yet, these advances come with a significant challenge: the high cost of model serving. To mitigate these costs, consumer-grade GPUs emerge as a more affordable alternative. This presents an opportunity for more cost-efficient LLM serving by leveraging these GPUs.
However, it is non-trivial to achieve high-efficiency LLM serving on consumer-grade GPUs, mainly due to two challenges: 1) these GPUs are often deployed in limited network conditions; 2) these GPUs often exhibit heterogeneity in host systems. To address these challenges, we present MoLink, a distributed LLM serving system for large models. It incorporates several key techniques, enabling efficient LLM serving on heterogeneous and weakly connected consumer-grade GPUs. Our experiments demonstrate that it achieves throughput improvements of up to 458% and cost-profit margin improvements of up to 151%, compared to state-of-the-art systems. MoLink allows users on Windows, Linux, and containerized VMs to seamlessly integrate GPUs with just a few lines of code over Ethernet or public networks. Currently, it supports 18 mainstream architectures of open-source large language models.