GenTorrent: Scaling Large Language Model Serving with An Overley Network

📅 2025-04-27

📈 Citations: 0

✨ Influential: 0

career value

210K/year

🤖 AI Summary

This work addresses the scalability bottleneck hindering small organizations and individuals in deploying large language models (LLMs). We propose the first overlay network framework tailored for decentralized LLM services, tackling four core challenges: dynamic overlay topology organization, end-to-end communication privacy, resource-aware efficient inference forwarding, and verifiable quality-of-service (QoS) guarantees. Our approach integrates peer-to-peer (P2P) self-organizing topology management, lightweight homomorphic encryption for secure inter-node communication, load- and latency-aware dynamic routing, and zero-knowledge proof-based QoS attestation. Evaluation of our prototype demonstrates over 50% reduction in end-to-end inference latency, negligible overhead from encryption and verification (<3%), and substantial improvements in throughput and node availability. The framework establishes a scalable, secure, and cryptographically verifiable paradigm for edge-deployed LLM services.

Technology Category

Application Category

📝 Abstract

While significant progress has been made in research and development on open-source and cost-efficient large-language models (LLMs), serving scalability remains a critical challenge, particularly for small organizations and individuals seeking to deploy and test their LLM innovations. Inspired by peer-to-peer networks that leverage decentralized overlay nodes to increase throughput and availability, we propose GenTorrent, an LLM serving overlay that harnesses computing resources from decentralized contributors. We identify four key research problems inherent to enabling such a decentralized infrastructure: 1) overlay network organization; 2) LLM communication privacy; 3) overlay forwarding for resource efficiency; and 4) verification of serving quality. This work presents the first systematic study of these fundamental problems in the context of decentralized LLM serving. Evaluation results from a prototype implemented on a set of decentralized nodes demonstrate that GenTorrent achieves a latency reduction of over 50% compared to the baseline design without overlay forwarding. Furthermore, the security features introduce minimal overhead to serving latency and throughput. We believe this work pioneers a new direction for democratizing and scaling future AI serving capabilities.

Problem

Research questions and friction points this paper is trying to address.

Scaling LLM serving for small organizations and individuals

Decentralized overlay network for efficient resource utilization

Addressing privacy and quality in decentralized LLM serving

Innovation

Methods, ideas, or system contributions that make the work stand out.

Decentralized overlay network for LLM serving

Privacy-aware LLM communication protocol

Efficient overlay forwarding reduces latency

🔎 Similar Papers

Towards Pareto Optimal Throughput in Small Language Model Serving

2024-04-04EuroMLSys@EuroSysCitations: 5

ByteDance

西雅图

Research Engineer / Scientist - Storage for LLM

ByteDance

圣何塞

Authors to Follow