🤖 AI Summary
To address three critical challenges in private LLM deployment—low GPU utilization, difficulty in multi-model load scheduling, and insufficient service reliability—this paper proposes a scalable, cost-efficient self-hosted LLM orchestration framework. Methodologically, it introduces: (1) a novel hybrid routing mechanism integrating keyword-based heuristics with lightweight DistilBERT classification; (2) an adaptive “scale-to-zero” auto-scaling policy; and (3) a unified Kubernetes- and Helm-based architecture for seamless deployment and coordinated multi-model scheduling. Experimental evaluation across 31,019 prompts and 163,720 inference requests demonstrates that, compared to static deployment, the framework improves request success rate by 21.6%, reduces average latency by 30%, and cuts per-query GPU cost by 33%. It effectively supports efficient collaborative inference across heterogeneous models—including Llama-3, Qwen-3, and DeepSeek-R1.
📝 Abstract
Self-hosting large language models (LLMs) is increasingly appealing for organizations seeking privacy, cost control, and customization. Yet deploying and maintaining in-house models poses challenges in GPU utilization, workload routing, and reliability. We introduce Pick and Spin, a practical framework that makes self-hosted LLM orchestration scalable and economical. Built on Kubernetes, it integrates a unified Helm-based deployment system, adaptive scale-to-zero automation, and a hybrid routing module that balances cost, latency, and accuracy using both keyword heuristics and a lightweight DistilBERT classifier. We evaluate four models, Llama-3 (90B), Gemma-3 (27B), Qwen-3 (235B), and DeepSeek-R1 (685B) across eight public benchmark datasets, with five inference strategies, and two routing variants encompassing 31,019 prompts and 163,720 inference runs. Pick and Spin achieves up to 21.6% higher success rates, 30% lower latency, and 33% lower GPU cost per query compared with static deployments of the same models.