GenTorrent: Scaling Large Language Model Serving with An Overley Network

📅 2025-04-27
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the scalability bottleneck hindering small organizations and individuals in deploying large language models (LLMs). We propose the first overlay network framework tailored for decentralized LLM services, tackling four core challenges: dynamic overlay topology organization, end-to-end communication privacy, resource-aware efficient inference forwarding, and verifiable quality-of-service (QoS) guarantees. Our approach integrates peer-to-peer (P2P) self-organizing topology management, lightweight homomorphic encryption for secure inter-node communication, load- and latency-aware dynamic routing, and zero-knowledge proof-based QoS attestation. Evaluation of our prototype demonstrates over 50% reduction in end-to-end inference latency, negligible overhead from encryption and verification (<3%), and substantial improvements in throughput and node availability. The framework establishes a scalable, secure, and cryptographically verifiable paradigm for edge-deployed LLM services.

Technology Category

Application Category

📝 Abstract
While significant progress has been made in research and development on open-source and cost-efficient large-language models (LLMs), serving scalability remains a critical challenge, particularly for small organizations and individuals seeking to deploy and test their LLM innovations. Inspired by peer-to-peer networks that leverage decentralized overlay nodes to increase throughput and availability, we propose GenTorrent, an LLM serving overlay that harnesses computing resources from decentralized contributors. We identify four key research problems inherent to enabling such a decentralized infrastructure: 1) overlay network organization; 2) LLM communication privacy; 3) overlay forwarding for resource efficiency; and 4) verification of serving quality. This work presents the first systematic study of these fundamental problems in the context of decentralized LLM serving. Evaluation results from a prototype implemented on a set of decentralized nodes demonstrate that GenTorrent achieves a latency reduction of over 50% compared to the baseline design without overlay forwarding. Furthermore, the security features introduce minimal overhead to serving latency and throughput. We believe this work pioneers a new direction for democratizing and scaling future AI serving capabilities.
Problem

Research questions and friction points this paper is trying to address.

Scaling LLM serving for small organizations and individuals
Decentralized overlay network for efficient resource utilization
Addressing privacy and quality in decentralized LLM serving
Innovation

Methods, ideas, or system contributions that make the work stand out.

Decentralized overlay network for LLM serving
Privacy-aware LLM communication protocol
Efficient overlay forwarding reduces latency
🔎 Similar Papers
F
Fei Fang
UC Santa Cruz
Y
Yifan Hua
UC Santa Cruz
S
Shengze Wang
UC Santa Cruz
R
Ruilin Zhou
UC Santa Cruz
Y
Yi Liu
UC Santa Cruz
C
Chen Qian
UC Santa Cruz
Xiaoxue Zhang
Xiaoxue Zhang
University of Nevada, Reno
Computer NetworksBlockChain