Efficient Multi-Model Orchestration for Self-Hosted Large Language Models

📅 2025-12-26
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address three critical challenges in private LLM deployment—low GPU utilization, difficulty in multi-model load scheduling, and insufficient service reliability—this paper proposes a scalable, cost-efficient self-hosted LLM orchestration framework. Methodologically, it introduces: (1) a novel hybrid routing mechanism integrating keyword-based heuristics with lightweight DistilBERT classification; (2) an adaptive “scale-to-zero” auto-scaling policy; and (3) a unified Kubernetes- and Helm-based architecture for seamless deployment and coordinated multi-model scheduling. Experimental evaluation across 31,019 prompts and 163,720 inference requests demonstrates that, compared to static deployment, the framework improves request success rate by 21.6%, reduces average latency by 30%, and cuts per-query GPU cost by 33%. It effectively supports efficient collaborative inference across heterogeneous models—including Llama-3, Qwen-3, and DeepSeek-R1.

Technology Category

Application Category

📝 Abstract
Self-hosting large language models (LLMs) is increasingly appealing for organizations seeking privacy, cost control, and customization. Yet deploying and maintaining in-house models poses challenges in GPU utilization, workload routing, and reliability. We introduce Pick and Spin, a practical framework that makes self-hosted LLM orchestration scalable and economical. Built on Kubernetes, it integrates a unified Helm-based deployment system, adaptive scale-to-zero automation, and a hybrid routing module that balances cost, latency, and accuracy using both keyword heuristics and a lightweight DistilBERT classifier. We evaluate four models, Llama-3 (90B), Gemma-3 (27B), Qwen-3 (235B), and DeepSeek-R1 (685B) across eight public benchmark datasets, with five inference strategies, and two routing variants encompassing 31,019 prompts and 163,720 inference runs. Pick and Spin achieves up to 21.6% higher success rates, 30% lower latency, and 33% lower GPU cost per query compared with static deployments of the same models.
Problem

Research questions and friction points this paper is trying to address.

Optimizes GPU utilization and workload routing for self-hosted LLMs
Reduces deployment and maintenance costs of in-house large language models
Improves reliability and success rates in multi-model LLM orchestration
Innovation

Methods, ideas, or system contributions that make the work stand out.

Kubernetes-based unified Helm deployment system
Adaptive scale-to-zero automation for GPU efficiency
Hybrid routing with keyword heuristics and DistilBERT classifier
B
Bhanu Prakash Vangala
Department of Electrical Engineering and Computer Science, University of Missouri, Columbia, MO 65211, USA
Tanu Malik
Tanu Malik
Associate Professor, University of Missouri, Columbia
Data Management SystemsData ProvenanceHPC systems